Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With the information in your example... I created a parquet file with 50 million rows, random data, same data types. Parquet file is 391 MB on disk (NTFS)

Query will complete, but in aprox 3,5 to 4 min, you will need up to 14 GB of memory. (4 Core, Win10, 32GB RAM).

You can see below, memory usage in MB, throughout the query, sampled at 15 sec interval.

  duckdb      321.01 -> Start Query
  duckdb     6302.12
  duckdb    13918.04
  duckdb    10963.74
  duckdb     8586.76
  duckdb     7613.86
  duckdb     6749.53
  duckdb     5990.96
  duckdb     5293.35
  duckdb     4205.53
  duckdb     3153.59
  duckdb     1482.86
  duckdb      386.29 -> End Query
So yes, there are some opportunities for optimization here :-)


Thanks for the benchmarks! :)

Indeed, 14GB seems really high for a 400MB Parquet file, that's a 35x multiple on the base file size.

Of course, the data is compressed on disk, but even the uncompressed data isn't that large so I believe indeed that quite a lot of optimisations are still possible.


It’s also the aggregation operation. If there are many unique groups it can take a lot of memory.

Newer DuckDbs are able to handle out of core operations better. But in general just because data fits in memory doesn’t mean the operation will — and as I said 8GB is very limited memory so it will entail spilling to disk.

https://duckdb.org/2024/03/29/external-aggregation.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: