With the information in your example... I created a parquet file with 50 million...

LunaSea · on June 11, 2024

Thanks for the benchmarks! :)

Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.

Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.

wenc · on June 11, 2024

It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.

Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.

https://duckdb.org/2024/03/29/external-aggregation.html