I was really surprised duckdb choked on 500GB. That's maybe a week's worth of data.
The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.
It doesn’t usually choke on 500 gb of data. I query 600 gb (equivalent to a few TBs of CSVs?) of parquets daily. It’s not the size of the data. It’s the type of data.
If date partitioning doesn’t work, just find another chunking key. The key is to get it into parquet format. CSV is just hugely inefficient.
Or spin up a larger compute instance with more memory. I have 256gb on mine.
I tried running an Apache Spark job (8 machine cluster) on a data lake of 300 Gb of TSVs once. This was a distributed cluster. There was one join in it. It timed out after 8 hours. I realized why — Spark had to do many full table scans of the TSVs and it was just so inefficient. CSV formats are ok for straight up reads, but any time you have to do analytics operations like aggregate or join them at scale, you’re in for a world of pain.
DuckDB has better CSV handling than Spark but a large dataset in a poor format will stymie any engine.
Did Athena on CSV work for you? I've used Athena and it struggles with CSV at scale too.
Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.
Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:
/yearwk=202501/chunk=24/000.parquet
Ask an LLM to suggest partioning scheme or think of one.
CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.
Yes, on Athena, we process much larger CSV files. But the cost is too crazy. We also have ORC and Parquet files for other dataset which we process with EMR Spark. I really want to get off those distributed analytic engines whenever possible.
I have to think about partition, Spark/Athena both had issues with partitioning by received date. They are scanning way too much data.
The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.