> Casually querying an S3 bucket with either of those is easily a half day's wor...

dmw_ng · on April 4, 2023

If someone were following the setup steps to execute a query in a serious setting (as opposed to reviewing them just for the sake of writing a sarcastic forum comment), they might find CREATE TABLE .. MSCK REPAIR TABLE quickly becomes finger memory, and the only real usability complaint about Athena is that its interactive query editor only allows 10 tabs to be open at a time

berkle4455 · on April 4, 2023

Nope. It's a pain in the ass and easily the least ergonomic thing I've ever encountered. Parquet has a schema. AWS Athena forces you to provide the schema for your parquet file, or Glue to "discover" it instead.

I'm sure your happy little setup works great for you, but to imply or suggest that it's somehow an easy solution is misleading to the point of outright lies.

brickteacup · on April 4, 2023

Jesus, why are you so angry at the prospect of having to execute a CREATE TABLE statement? Are you okay?

berkle4455 · on April 4, 2023

Because someone needs to be angry. The dev world is full of complacent fucks who are okay with this constant backslide of building perpetually worse software.

porker · on April 4, 2023

TIL PTSD from using AWS is real.

chrisjc · on April 4, 2023

Why bother doing any of that. Just set your bucket up with open access, and dump your files anywhere on S3, choose whatever format, compression and schema you want. Don't bother using prefixes or naming convention since that's a whole thing. Searching for what you're looking for is now really easy. Make sure you do a recursive list at the root of the bucket so you know what you're working with. It might take a while to respond, if so just recursively list again and again. Pick the files you want, download to your laptop and voila! Once you've finished munging your results drop it back in the bucket and head home. You're done!

Okay sorry for the sarcasm, but what does everyone want? Everyone makes fun of the DataSwamp (DataLake, Lakehouse, etc), but that's what you're gonna get if you're not going to put any effort into learning your data storage strategy or discipline into maintaining and securing it.

If someone wanted you to have access to this data, then have them take a few minutes to set you up with an external table. It's probably as easy as `CREATE EXTERNAL TABLE ...`.

After all, Data Engineering is just a fancy name for Software Engineering. ;)

berkle4455 · on April 4, 2023

Using a blob storage platform that charges you every time you read, write, or access is your data is a shit strategy both for cost and speed. The entire premise is flawed.

Use it for backup, sharing, and if you must, for at-rest data. But don’t use it for operational data.

chrisjc · on April 4, 2023

Right, i don't think anyone is suggesting to use this as an operational data store.

Use it for infrequent/read/analytical/training access patterns. Set you bucket to infrequently accessed mode, partition data and build out a catalogue so you're doing as little listing/scanning/GETting as possible.

Use an operational database for operational interaction patterns. Unload/elt historical/statistical/etc data out to your data warehouse (so the analytical workloads aren't bring down your operational database) as either native or external tables (or a hybrid of both). Cost and speed against this kinda of data is going to be way cheaper than most other options mostly bc this is columnar data with analytical workloads running against it.

berkle4455 · on April 4, 2023

> Right, i don't think anyone is suggesting to use this as an operational data store.

Databricks, Snowflake, AWS, Azure, GCP, and numerous cloud-scale databases providers are 100% suggesting people do precisely that (even without realizing it, e.g. Snowflake). It's either critical to their business models or at least some added juice when people pay AWS/GCP $5 per TB scanned. That's why these shit tools and mentality keep showing up.

chrisjc · on April 4, 2023

K, I'm seeing the crux of your contention. Seems to either revolving around the idea of using cloud services in general, or specifically using S3 (or any blob storage) as your data substrate. That's fair.

Out of curiosity, what are you suggesting one should use to store and access vast amounts of data in a cheap and efficient manner?

Regarding using something like Snowflake as your operational database, I'm not sure anyone would do that. Transactional workloads would grind to halt and that's called out all the time in their documentation. The only thing close to such a suggestion probably won't be seen until Snowflake releases their Hybrid (OLTP) tables.