Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, and actually, it is simpler than expected. The cached segments are written into the local filesystem, and therefore we get both the cache on SSD and the page cache in memory "for free". As ClickHouse is OLAP, the read/written segments are not so small (more than 1 MB is expected), so it works fairly well (the same would be more tricky for OLTP).

It requires some considerations. For example, the linear read/write performance on typical c5d/m5d/r5d instances is just around 4 GB/sec, which is faster than 25 Gbit network, but slower than 40 Gbit network. You get 25 Gbit network on larger normal instances in AWS, otherwise, it requires "n" - network-optimized instances. S3 read performance is higher - 50 Gbit/sec achieved from a single server, higher values will require either more CPU or uncompressible data (the data is usually compressed "too much" for a 100 Gbit network to become a bottleneck). But S3 is a tricky beast, and it won't give a good performance unless you read in parallel and with carefully selected ranges from multiple files. The higher latencies of S3 are expected to be around 50 ms unless you are being throttled (and you will be throttled), which is worse than both local SSDs and every type of EBS. S3 is cheap and saves cost on cross-AZ traffic... unless you make too many requests. So, a lot of potential troubles and a lot of things to improve.



You won’t be throttled if you partition your data “correctly” and have a reasonable amount of data in your bucket. That’s difficult to do, but more than possible.


We already partition our data correctly, as the AWS solution architects recommend, but there are limitations on the: - total throughput (100 Gbit/sec); - the number of requests.

For example, at this moment, I'm doing an experiment: creating 10, 100, and 500 Clickhouse servers, and reading the data from s3, either from a MergeTree table or from a set of files.

Ten instances saturate 100 Gbit/sec bandwidth, and there is no subsequent improvement.

JFYI, 100 Gbit/sec is less than one PCI-e with a few M.2 SSDs can give.


Ahh yes, sorry, you are running it on a single instance. That’s capped, but the aggregate throughput across instances can be a lot higher.

There are request limits but these are per partition. There is also an undocumented hard cap on list requests per second which is much lower than get object.


No, I'm creating 10, 100, and 500 different EC2 instances and measuring the aggregate throughput.


Worth to note that ClickHouse is a distributed MPP DBMS, it can scale up to 1000s of bare-metal servers and process terabytes (not terabits) of data per second.

It also works on a laptop, or even without installation.


100 instances of m5.8xlarge, 500 instances of m5.2xlarge


Something must be wrong, at my previous work we were able to achieve a much higher aggregate throughput - are you spreading them across AZ, and are you using a VPC endpoint?

We did use multiple VPCs, that might make a difference


Yes, the current experiment was run from a single VPC. It should explain the difference.


Why not randomise VPCs between instances? It’s fairly easy to do with a fixed pool and random selection.






Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: