So do I understand the landing page correctly: It is possible to run Clickhouse ...

mikeshi42 · on Sept 19, 2023

You can definitely run Clickhouse directly on S3 [1] - though we don't run _just_ on S3 for performance reasons but instead use a layered disk strategy.

A few of the weaknesses of S3 are:

1. API calls are expensive, while storage in S3 is cheap, writing/reading into it is expensive. Using only S3 for storage will incur lots of API calls as Clickhouse will work on merging objects together (which require downloading the files again from S3 and uploading a merged part) continuously in the background. And searching on recent data on S3 can incur high costs as well, if you're constantly needing to do so (ex. alert rules)

2. Latency and bandwidth of S3 are limited, SSDs are an order of magnitude faster to respond to IO requests, and also on-device SSDs typically have higher bandwidth available. This typically is a bottleneck for reads, but typically not a concern for writes. This can be mitigated by scaling out network-optimized instances, but is just another thing to keep in mind.

3. We've seen some weird behavior on skip indices that can negatively impact performance in S3 specifically, but haven't been able to identify exactly why yet. I don't recall if that's the only weirdness we see happen in S3, but it's one that sticks out right now.

Depending on your scale and latency requirements - writing directly to S3 or a simple layered disk + S3 strategy might work well for your case. Though we've found scaling S3 to work at the latencies/scales our customers typically ask for require a bit of work (as with scaling any infra tool for production workloads).

[1] https://clickhouse.com/docs/en/integrations/s3