How big are the data sets? I've been trying to get duckdb to work in our company...

wenc · 2025-05-04T22:58:31 1746399511

CSV is a pretty bad format any engine will choke on it. It basically requires a full table scan to get at any data.

You need to convert it into Parquet or some columnar format that lets engines do predicate pushdowns and fast scans. Each parquet file stores statistics about the data it contains so engines can quickly decide if it’s worth reading the file or skipping it altogether.

nojito · 2025-05-04T13:40:40 1746366040

CSV are a poor format to access from S3.

Should convert them to parquet then access and analytics becomes cheap and fast.

aynyc · 2025-05-04T15:04:38 1746371078

I agree. That's how our data is produced. We constantly generate real time data into CSV. As far as I can tell, I can't append to parquet file.

wenc · 2025-05-04T22:56:39 1746399399

Parquet files are already built for append only. Just add a new file.

This is a new paradigm for folks who aren’t in big data — the conventional approach usually involves doing a row INSERT. In big data, appending simply means adding a new file - the database engine will immediately recognize its presence. This is why “select * from ‘*.parquet’” will always operate on the latest dataset.

aynyc · 2025-05-05T01:51:08 1746409868

Wait, so I create a new file for every message?

wenc · 2025-05-05T02:50:28 1746413428

Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?

For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:

  /data/yearwk=202501/000.parquet
  /data/yearwk=202502/000.parquet

This is also called the Hive directory structure. When I query, I just do:

  select * from '/data/**/*.parquet';

This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.

500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.

  COPY (FROM '/data/*.csv') TO 'my.parquet' (FORMAT PARQUET);

aynyc · 2025-05-05T12:49:35 1746449375

I was really surprised duckdb choked on 500GB. That's maybe a week's worth of data.

The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.

wenc · 2025-05-05T16:13:40 1746461620

It doesn’t usually choke on 500 gb of data. I query 600 gb (equivalent to a few TBs of CSVs?) of parquets daily. It’s not the size of the data. It’s the type of data.

If date partitioning doesn’t work, just find another chunking key. The key is to get it into parquet format. CSV is just hugely inefficient.

Or spin up a larger compute instance with more memory. I have 256gb on mine.

I tried running an Apache Spark job (8 machine cluster) on a data lake of 300 Gb of TSVs once. This was a distributed cluster. There was one join in it. It timed out after 8 hours. I realized why — Spark had to do many full table scans of the TSVs and it was just so inefficient. CSV formats are ok for straight up reads, but any time you have to do analytics operations like aggregate or join them at scale, you’re in for a world of pain.

DuckDB has better CSV handling than Spark but a large dataset in a poor format will stymie any engine.

aynyc · 2025-05-06T01:05:54 1746493554

We have a spark cluster too. Then switch to Athena. I just dislike the cost structure.

The problem with disk based partition is keys are difficult to manage properly.

wenc · 2025-05-06T02:38:01 1746499081

Did Athena on CSV work for you? I've used Athena and it struggles with CSV at scale too.

Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.

Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:

/yearwk=202501/chunk=24/000.parquet

Ask an LLM to suggest partioning scheme or think of one.

CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.

aynyc · 2025-05-06T11:15:52 1746530152

Yes, on Athena, we process much larger CSV files. But the cost is too crazy. We also have ORC and Parquet files for other dataset which we process with EMR Spark. I really want to get off those distributed analytic engines whenever possible.

I have to think about partition, Spark/Athena both had issues with partitioning by received date. They are scanning way too much data.

snake_doc · 2025-05-04T13:33:20 1746365600

Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…

aynyc · 2025-05-04T15:02:37 1746370957

No, the EC2 instance doesn't have 500GB of data. Does DuckDB require that? I actually downloaded the data from S3 to local EBS and still choked.

broner · 2025-05-04T21:47:56 1746395276

Works fine for me on TB+ datasets. Maybe you were doing in-memory rather than persistent database and running out of RAM? https://duckdb.org/docs/stable/clients/cli/overview.html#in-...

aynyc · 2025-05-05T01:52:34 1746409954

Wait, do you insert the data from S3 into duckdb? I was just doing select from file.

broner · 2025-05-06T00:22:17 1746490937

Nope, just reading from S3. Check this out: https://duckdb.org/2024/07/09/memory-management.html

fastasucan · 2025-05-05T18:02:40 1746468160

Maybe its your terminal that chockes because it tries to display to much data? 500GB should be no problem.

higeorge13 · 2025-05-04T14:02:03 1746367323

Could you test with clickhouse-local? It always works better for me.

aynyc · 2025-05-04T15:03:07 1746370987

No, clickhouse is not considered for some other reason. But I think I might revisit it sometime in the future.