Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Casually querying an S3 bucket with either of those is easily a half day's work for any meaningful size of workload.

Clickhouse and DuckDB you just type: select * from (s3 path and access credentials)

Let's see what it takes on AWS Athena:

1) "Before you run your first query, you need to set up a query result location in Amazon S3."

2) What the fuck is a AwsDataCatalog? Why can't I just point it to a file/path?

3) What the fuck is AWS Glue Crawler? Oh shit another service I have to work with.

4) Oops. Glue can't access my bucket, now I have to create a new IAM role for it.

5) Okay, great I can finally run a query!

6) It wrote my results to a text file on S3. Awesome. I feel so goddamn efficient. I'm so fucking glad you suggested this rube goldberg disaster.

And if I don't have elevated access to our AWS account, it's not even that simple, instead there's slack messages and emails and jira tickets to get DevOps to provision me and/or set it up for me (incorrect the first time of course). A week later I finally got a text file on a fucking S3 bucket. Awesome.



If someone were following the setup steps to execute a query in a serious setting (as opposed to reviewing them just for the sake of writing a sarcastic forum comment), they might find CREATE TABLE .. MSCK REPAIR TABLE quickly becomes finger memory, and the only real usability complaint about Athena is that its interactive query editor only allows 10 tabs to be open at a time


Nope. It's a pain in the ass and easily the least ergonomic thing I've ever encountered. Parquet has a schema. AWS Athena forces you to provide the schema for your parquet file, or Glue to "discover" it instead.

I'm sure your happy little setup works great for you, but to imply or suggest that it's somehow an easy solution is misleading to the point of outright lies.


Jesus, why are you so angry at the prospect of having to execute a CREATE TABLE statement? Are you okay?


Because someone needs to be angry. The dev world is full of complacent fucks who are okay with this constant backslide of building perpetually worse software.


TIL PTSD from using AWS is real.


Why bother doing any of that. Just set your bucket up with open access, and dump your files anywhere on S3, choose whatever format, compression and schema you want. Don't bother using prefixes or naming convention since that's a whole thing. Searching for what you're looking for is now really easy. Make sure you do a recursive list at the root of the bucket so you know what you're working with. It might take a while to respond, if so just recursively list again and again. Pick the files you want, download to your laptop and voila! Once you've finished munging your results drop it back in the bucket and head home. You're done!

Okay sorry for the sarcasm, but what does everyone want? Everyone makes fun of the DataSwamp (DataLake, Lakehouse, etc), but that's what you're gonna get if you're not going to put any effort into learning your data storage strategy or discipline into maintaining and securing it.

If someone wanted you to have access to this data, then have them take a few minutes to set you up with an external table. It's probably as easy as `CREATE EXTERNAL TABLE ...`.

After all, Data Engineering is just a fancy name for Software Engineering. ;)


Using a blob storage platform that charges you every time you read, write, or access is your data is a shit strategy both for cost and speed. The entire premise is flawed.

Use it for backup, sharing, and if you must, for at-rest data. But don’t use it for operational data.


Right, i don't think anyone is suggesting to use this as an operational data store.

Use it for infrequent/read/analytical/training access patterns. Set you bucket to infrequently accessed mode, partition data and build out a catalogue so you're doing as little listing/scanning/GETting as possible.

Use an operational database for operational interaction patterns. Unload/elt historical/statistical/etc data out to your data warehouse (so the analytical workloads aren't bring down your operational database) as either native or external tables (or a hybrid of both). Cost and speed against this kinda of data is going to be way cheaper than most other options mostly bc this is columnar data with analytical workloads running against it.


> Right, i don't think anyone is suggesting to use this as an operational data store.

Databricks, Snowflake, AWS, Azure, GCP, and numerous cloud-scale databases providers are 100% suggesting people do precisely that (even without realizing it, e.g. Snowflake). It's either critical to their business models or at least some added juice when people pay AWS/GCP $5 per TB scanned. That's why these shit tools and mentality keep showing up.


K, I'm seeing the crux of your contention. Seems to either revolving around the idea of using cloud services in general, or specifically using S3 (or any blob storage) as your data substrate. That's fair.

Out of curiosity, what are you suggesting one should use to store and access vast amounts of data in a cheap and efficient manner?

Regarding using something like Snowflake as your operational database, I'm not sure anyone would do that. Transactional workloads would grind to halt and that's called out all the time in their documentation. The only thing close to such a suggestion probably won't be seen until Snowflake releases their Hybrid (OLTP) tables.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: