With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)
Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).
You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.
Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.
Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.
It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.
Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.
Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).
You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.
So yes, there are some opportunities for optimization here :-)