ClickHouse, a column-oriented DBMS to generate analytical reports in real time

lykr0n · on Sept 30, 2018

Clickhouse has some weird quirks when you think of it as a SQL Database, but its astounding to use it. It's faster than one would think, it can do some really cool data modeling, and provides a wealth of features for the average user out of the box.

The most important thing, and the thing that makes it attractive to me is that it is almost stupidly simple to setup and get running. It's quite simple (when you wrap your head around it) to do sharding or replication and scale up. The zookeeper stuff takes a bit more effort, but most of that is due to zookeeper and not ClickHouse.

gary__ · on Oct 1, 2018

A look through the below does highlight some of its differences with a standard sql database.

https://www.slideshare.net/Altinity/migration-to-clickhouse-...

Year on now, perhaps things have changed.

tadkar · on Oct 1, 2018

Second the “stupidly simple to setup and get running”. My company works with billion row datasets on client sites where we get super locked down accounts. Clickhouse is a single binary that you can run with no actual “install” needed.

Also echo the comments in the rest of the discussion about it being blazing fast. On our beefier machines we get querying in the 100s of millions of rows per second when the data is not in cache.

drej · on Oct 1, 2018

I remember playing with it a while ago and here are my 2c:

- It's ridiculously fast, like, I didn't know where the performance was coming from.

- Getting it up and running was a bit clunky (Docker saved me), but I hear it's better now.

- It has non-standard (I mean, everyone is non-standard, but this is way off ANSI) and case-sensitive (???) SQL syntax. This annoyed me again and again.

- It seemed (and still seems) like a project that lives and dies with one developer - and no matter how brilliant he may be, I'm not willing to invest in a technology that has this risk and it's so hard to migrate off of (because of the non-standard SQL).

I'm sad about the last two points, because the database is rather brilliant otherwise.

PeterZaitsev · on Oct 1, 2018

If you need Advanced SQL Support ClickHouse is not there (Yet) but if you need high performance for relatively basic queries ClickHouse is great.

It is developed mostly by ClickHouse staff but there is at least one company https://www.altinity.com/ which offers Commercial Support, Consulting, Trainin for ClickHouse

dschuler · on Oct 1, 2018

By one developer you mean Yandex or that most commits are made by a couple of users? Being backed by a large company (the Russian Google apparently) that has an independent revenue stream seems like a large plus, but maybe not enough to cancel out.

I'm wary of investing effort into a potentially unsupported project as well, but I wonder if ClickHouse only seems "out there" because we're not aware of the Russian tech ecosystem (at least I'm not).

People don't seem concerned about building anything with Firebase, but Google doesn't have a good track record of changing its mind about priorities or service pricing.

What would you recommend instead for a column-oriented db that you can self-host (commercial or open source)?

drej · on Oct 1, 2018

It's in the Yandex namespace and it is used by said company, which is a huge plus. But if you looked at the development history just a while ago, it was highly dependent on Alexey.

It reminds me of Grumpy (https://github.com/google/grumpy), which was released by Google, but was later basically abandoned when the lead left Google.

That being said, the situation is better than last time I checked this, there is a handful of somewhat active developers. https://github.com/yandex/ClickHouse/graphs/contributors

kermatt · on Oct 1, 2018

> It seemed (and still seems) like a project that lives and dies with one developer

One major contributor (who may be project lead at Yandex?) and a lot of active contributors: https://github.com/yandex/ClickHouse/graphs/contributors

Grue3 · on Oct 1, 2018

>It seemed (and still seems) like a project that lives and dies with one developer

This project is developed by Yandex and they have a team working on it.

dang · on Sept 30, 2018

Discussion from a couple years ago: https://news.ycombinator.com/item?id=11908254

georgewfraser · on Oct 1, 2018

The basic techniques for implementing a fast column-store data warehouse have been well-known for 10 years. There are several excellent commercial and open-source implementations of these techniques:

    - BigQuery
    - Snowflake
    - Redshift
    - Presto

ClickHouse is not one of them. It doesn't have:

    - Transactions
    - Distributed joins
    - Separate compute from storage
    - UPDATE
    - User management

I don't mean to be a jerk, I'm just trying to save people some time. Columnar DBs is well-trod territory and ClickHouse is way behind.

ehfeng · on Oct 1, 2018

I wouldn't call Redshift "excellent", nor would I call ClickHouse "way behind". ClickHouse was the best choice for my last employer's use case (https://twitter.com/zeeg/status/987009550501928960), after many other solutions were tested and benchmarked.

Just because a tool doesn't have a specific feature checklist doesn't mean you should categorically rule it out, particularly if you don't have experience using/running/deploying it.

manigandham · on Oct 1, 2018

Redshift doesn't separate compute from storage either unless you're using Spectrum. Presto isn't a database at all and can read from many data stores. The rest are all cloud-hosted with lots of moving parts. MemSQL, Vertica, Actian, Greenplum, and SQL Server are better comparisons.

ClickHouse is a column-oriented db and actually one of the most advanced, focusing on performance at all costs with lots of table storage engines that provide flexibility for your exact use-case. It also supports distributed joins and deletes but has some limitations they are working on.

It can definitely use better tooling and compatibility though, but that's the tradeoff the core team made, and it seems to be working well for the companies that can afford the time and talent.

georgewfraser · on Oct 1, 2018

My point is there’s just no advantage to ClickHouse. The things that make it fast are in every column store. There’s other options that do everything it does and more.

manigandham · on Oct 1, 2018

Speed is the advantage. It's very fast for a self-hosted system and probably only beaten by BigQuery for throughput.

You're right though that most people don't need it and can get 90% of the speed with better usability with the other options.

kermatt · on Oct 1, 2018

By this definition we should all be running Oracle.

PeterZaitsev · on Oct 1, 2018

There is quite a difference between theoretical technologies and stable high-performance implementation. Majority of things ClickHouse does are very well known it just does them

Here is example Performance comparison we did at Percona https://www.percona.com/blog/2017/02/13/clickhouse-new-opens...

bretthoerner · on Oct 1, 2018

ClickHouse stable has both UPDATE and DELETE.

ehfeng · on Oct 1, 2018

Hi Brett

theshadowknows · on Oct 1, 2018

We’re looking into Snowflake at my current job. Have you used it before?

manigandham · on Oct 1, 2018

Snowflake is great for smaller data volumes, but slow at supersized datasets. It has the best UX out of all the cloud-hosted options though, and the best JSON handling. Interesting billing model based on running time that's a good fit for focused data querying sessions using BI tools where you can scale up while in use and turn off after.

Tables are stored as S3 data making sharing, cloning, and snapshot queries easy. Major missing features are lack of streaming (although they have a auto loading option from S3 files) and no stored procedures. Transactions are also tricky and client drivers are sometimes neglected because everything is a wrapper around their HTTP/JSON interface.

dikei · on Oct 1, 2018

I wouldn't dismiss ClickHouse so quickly. There's no one-size-fit-all solution for data warehouse, everything has its own quirks.

mamcx · on Oct 1, 2018

> The basic techniques for implementing a fast column-store

Pointers to what them are?

manigandham · on Oct 1, 2018

Recent post: https://news.ycombinator.com/item?id=18076547

PeterZaitsev · on Oct 1, 2018

ClickHouse Indeed does not do "Separate Compute from Storage" yet it is architectural decision not a feature gap. Running ClickHouse with directly attached storage and built in replication can be super fast and cost efficient. It works best for stable workloads

tuananh · on Oct 2, 2018

CloudFlare is using ClickHouse. That does say something

https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

dorfsmay · on Oct 1, 2018

Can it run on a cluster? Or single server only?

Can you update specific rows? How fast are updates?

How does it compare to monetDB.

sin7 · on Oct 1, 2018

You can run it on a cluster or a single server. It's pretty easy to setup either way.

No updates. Fast inserts.

You can only join two tables at a time, but the joins can be chained to deal with this limitation.

I tried Monet. It wasn't very stable for me. I didn't stick with it long enough to judge it. ClickHouse has backing of Yandex. I think that makes a huge difference.

I have used Clickhouse for the past year. Thrown 3000 column by 120 million row tables on it. It worked where PostgreSQL came to a halt. Different use cases really.

I fits my use case perfectly. Large amounts of data with no updates and tons of aggregations. It's lighting fast.

manigandham · on Oct 1, 2018

It does support updates and deletes now, but still limited.

eldargab · on Oct 1, 2018

monetDB is a sort of drop-in replacement for a regular database with all expected features and good compatibility.

On other hand ClickHouse will be incompatible with most of existing tools and it's better to learn well its limitations and workaround technics in advance. But once you dump into it substantial amounts of time series data you'll find it 10+ times faster and 2-3 times smaller than monet.

bicubic · on Oct 1, 2018

Is there a way to handle scenarios where the data does update? Like backfill of latent events within a 1-2 day window.

sin7 · on Oct 3, 2018

You can copy the stable data to another table, then insert the updated data. The database has excellent compression and copying from one table to another is super fast.