Capacitor, BigQuery’s next-generation columnar storage format

eva1984 · on April 26, 2016

This simple idea has been helping me a lot in the past projects.

The performance gain from columnar storage is the compression ratio. And by ordering similar attributes together, it is going to greatly reducing the entropy between rows, which in turn leads to high compression and better performance.

The trick is to smartly select the which column you are going to have all the rows sorted upon.

I my previous company, we are using Redshift to encode 1 billions rows, and the simple change to let the table sorted by user_id reduce the whole table size by 50%, that is half a TB of disk storage, the improvement is nothing more but jaw-dropping. I think Google here just takes this trick into a more systematic method, which is really neat.

To point out, in columnar storage system, take ordering into account. Try some ordering that you feel could maximize the redundancy between rows, usually it is going to be primary id that is most representative of the underlying data. You don't need to have a fancy system like this one to leverage this power idea, it could apply to all columnar systems.

mtanski · on April 27, 2016

Sorting esp. if your data has some kind of natural sort order can increase compression (in columnar stores), decrease index size (bitmap or fancy bitmap indexes) or lower query execution time (data is cluster vs random read).

Here's one paper: Sorting improves word-aligned bitmap indexes, http://arxiv.org/abs/0901.3751

kgp7 · on April 26, 2016

> doesn’t end here. BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. Whatever the case might be, when the system detects an opportunity to improve storage, it kickstarts data conversion tasks. These tasks do not compete with queries for resources, they run completely in parallel, and don’t degrade query performance. Once the new, optimized storage is complete, it atomically replaces old storage data — without interfering with running queries. Old data will be garbage-collected later.

I wonder if they could share more details on how this is handled.

jandrewrogers · on April 27, 2016

There are a few ways to do it. It is not that difficult in principle, you need to collect selectivity statistics for both writes and queries and have a storage engine that is flexible enough to rewrite layouts on the fly. The mechanics are pretty simple, since rewriting a shard can be viewed as a trivial subset of splitting or replicating a shard under load. Some closed source databases also do this to one extent or another, adapting layout to load.

You can do adaptive layout rewriting at either the page or shard level, depending on the design. There are advantages and disadvantages to both models. Some designs can do layout conversion in place without the need for garbage collection but it is much trickier to do correctly.

polskibus · on April 27, 2016

Slightly offtopic, but it's great to see that Mosha of MDX and SQL Server Analysis Services fame (data warehouse from MS) is now part of Google's BigQuery. His throrough blog posts full of technical details are a blessing even after so many years.

michaelmior · on April 26, 2016

I'm surprised they didn't give a justification as to why they couldn't just adopt Parquet[0].

[0] https://parquet.apache.org/

int_handler · on April 26, 2016

It is clearly stated in the article:

  When Google published the Dremel paper in 2010, it explained how this structure is preserved within column store.
  ...
  The definition and repetition levels encoding is so efficient for semistructured data that other open source columnar formats, such as Parquet, also adopted this technique.

Parquet is an open-source reimplementation of the columnar storage format described in Google's 2010 Dremel paper. Capacitor is Google's next-generation columnar storage format.

michaelmior · on April 29, 2016

That answer isn't very clear to me, but thanks for pulling the relevant bits.

fhoffa · on April 26, 2016

BigQuery is faster than anything else I've seen.

Why switch to what everyone else is using then?

(See http://tech.marksblogg.com/billion-nyc-taxi-rides-bigquery.h... vs all other benchmarks for the same dataset Mark got)

Disclaimer: I'm Felipe Hoffa and I work for Google (https://twitter.com/felipehoffa). But you can try BigQuery in the next 5 minutes and check the speed claims :).

michaelmior · on April 29, 2016

Thanks for the reply :) I'll have to play around with BigQuery and see for myself!

> BigQuery is faster than anything else I've seen. > Why switch to what everyone else is using then?

This really depends on how much faster. For a marginal drop in performance, many would think it's worthwhile to stick with an established format. That said, I'm willing to believe the performance delta for BigQuery is worth it :)

_pfwi · on April 26, 2016

Parquet is the Open Source implementation of Big Query's first generation columnar storage format. It is based on record shredding and assembly techniques described in Dremel paper (Google's internal name for Big Query). What they are talking here is the next generation (second generation) storage format.

estefan · on April 27, 2016

Not sure why there are all the down votes (if OP is asking why they don't support parquet as an input format). If Google support avro, why not parquet? JSON isn't exactly efficient if it's about efficiency.

We dump the output of spark jobs into BQ for exploration and having to produce JSON in addition to parquet is an irritating (and expensive) overhead.

pacala · on April 27, 2016

The post is about the [adaptive] format BQ uses internally for fast queries. Your problem is having BQ ingesting data in Parquet format. While BQ should support Parquet ingestion, it should also rewrite it on-the-fly into whatever format brings faster query performance.

estefan · on April 27, 2016

I agree.