PostgreSQL B-Tree index deduplication

yxhuvud · on Sept 7, 2020

This is an great change, but I wonder, will it apply retroactively to indices from old databases or will the indices have to be rebuilt? Like, I understand that just upgrading the version won't reduce the sizes, but will reading and writing to the indices gradually reduce the size?

petergeoghegan · on Sept 7, 2020

You will need to REINDEX after upgrading using pg_upgrade. Otherwise deduplication won't be applied in the old indexes at all.

These days it's possible to use the CONCURRENTLY option with REINDEX, which is much less disruptive.

CodeWriter23 · on Sept 7, 2020

In the second paragraph, the author links a couple of other articles that drill down some more. I almost missed those myself.

andy_ppp · on Sept 7, 2020

Cool, I’d love to know how this affects query performance, rather than just index size?

dspillett · on Sept 7, 2020

Like SQL Server's row-level and page-level compression, you get more rows per page, so need to read less pages to process the query. From cold this means less disk or network IO loading the pages from storage into RAM, and it means you can fit more into RAM (and potentially CPU cache when spinning over small enough chunks of data) so you can do the same with less RAM or reduce IO so be faster with the same RAM allocation (as pages will not need to be evicted as often to make room for others).

The trade-off is a bit of CPU time, every time the page is referenced. Not just when it is updated or read from storage, but potentially every time it is accessed, even just reading, as the pages held in RAM are in the same format as on disk so the decompression needs to happen on read also (unless postgres keeps a cache of pre-decompressed pages to help with this but my gut suggests that would not be worth the complication - where it would help significantly you would just not use this technique anyway).

In a great many scenarios, I would say most, this trade-off is worth it because of the relative cost of IO and CPU time needed to save some of it.

It might be significantly detrimental to performance in situations where enough of your data fits into RAM anyway and you are performing CPU-heavy analysis upon it.

Side benefits include reducing the size of your backups and reducing the time needed to take them, so there are other potential infrastructure and administration savings to be had too.

petergeoghegan · on Sept 7, 2020

There is no overhead for reads, since deduplication doesn't compress the data as such. Rather, deduplication creates a representation that removes duplicates but is otherwise identical to the existing representation.

petergeoghegan · on Sept 7, 2020

It's hard to give a simple answer, but it's certainly possible for it to help performance significantly. Both for queries, and garbage collection by VACUUM. It's especially helpful with writes + low cardinality indexes.

Skinney · on Sept 7, 2020

Roughly speaking: smaller index size = greater performance.

The index also has to be read from disk (if not already in memory), which is faster the smaller it is. But also, once in memory, smaller size makes key-lookups or traversals faster.

How much faster? It really depends on your index.

petergeoghegan · on Sept 7, 2020

B-Tree deduplication can also help to control duplicate index tuples caused by MVCC "version churn" -- including in unique indexes. See the paragraph that begins with "B-Tree indexes are not directly aware that under MVCC..." here: https://www.postgresql.org/docs/devel/btree-implementation.h...

mjevans · on Sept 7, 2020

If it helps either make tables small enough for RAM, or what used to be RAM small enough for cache, then it could help a LOT.

redis_mlc · on Sept 7, 2020

Postgres has a write amplification problem - when you update or insert a row, it has to check all the indexes to see what was touched (unlike MySQL.) Uber made this famous in a blog post where they migrated off postgres because of it.

So with Postgres I try to keep the number of indexes per table down to around 5, but with MySQL InnoDB, 20 indexes is not a problem for many use cases.

The other little-known Postgres problem is that by default, installing it sets up replication that writes a local file in a busy manner. This can cause problems with fsck or on low-end SSD/flash storage resulting in corruption that causes postgres restarts to fail hard, so disable that.

fabian2k · on Sept 7, 2020

I've no idea what the part about replication to a local file is. Postgres doesn't set up any replication by default.

And if you're getting corruption in your database on low-end storage, that storage is simply broken. The database can't do anything if your storage system lies to it, or writes different data than what you give it. It is possible to configure Postgres in a way that leads to data corruption when e.g. the power goes out, but that is an option with a big warning that you should never use outside some specific circumstances. The Postgres defaults don't corrupt your database if your hardware works correctly (absent some pretty rare bugs in Postgres itself).

andy_ppp · on Sept 7, 2020

I’m interested to find out why the community feel this answer is incorrect. I don’t know enough about database internals to know! Would be good if someone could help me to become more informed about this. I can’t personally see a way that indexes wouldn’t be touched when inserts happen in MySQL but who knows?

jchb · on Sept 7, 2020

Previously on HN:

Why Uber Engineering Switched from Postgres to MySQL (2016): https://news.ycombinator.com/item?id=17280239

A PostgreSQL response to Uber [pdf]: https://news.ycombinator.com/item?id=14222721

One important point is that Uber wrote this running on Postgres 9.2. The latest Postgres release is now 12.4

castorp · on Sept 7, 2020

> One important point is that Uber wrote this running on Postgres 9.2. The latest Postgres release is now 12.4

Just a minor nitpick: Postgres changed the version numbering scheme with version 10. So 9.4 was a major release, but 12.4 marks a minor (bugfix) release.

If you want to name minor releases it would be e.g. 9.4.26 vs. 12.4. If you want to compare major versions it would be 9.4 vs. 12

takeda · on Sept 7, 2020

Basically the first number didn't really have any meaning, so it was removed, so the major versions (which I believe are released once a year) are:

9.2 (released in 2012)

9.3

9.4

9.5

9.6

10

11

12

13 (about to be released this year?)

That's still a lot of versions in between, each of them as new functionality and requires converting the database to upgrade.

andy_ppp · on Sept 7, 2020

Thanks, I just always trust Postgres more from my experiences. Reading some of the above I’m fairly unconvinced by Uber’s reasoning...

pgaddict · on Sept 7, 2020

I'd say some of the limitations hit by Uber were real, and were a bit of a PITA even before that. Some were improved significantly since then, others still need to be considered when designing application accessing PostgreSQL.

I'd say it was a nice example what happens when you design application based on how you imagine an ideal database would work, not realizing that there's a lot of complexity and limitations. And then when you eventually have to face reality you blame the database for having to do trade-offs and refuse to mitigate the trouble by changing the application.

takeda · on Sept 7, 2020

All jobs that I had where MySQL was used had some kind of issues. First time there was a database that one day just died and refused to start. When investigating, it turns that some tables got corrupted (fortunately, the scan tool that Congress with MySQL was able to fix it, but I also encountered databases which were FUBARed). After checking the machine, it had over 100 days off uptime (so it wasn't abrupt shutdown), the disk also had plenty of free space, so the corruption didn't happen because of disk being full. So MySQL just broke itself out of nowhere.

Another time (I think it was a MariaDB) the database get corrupted every few days. First we though it was hardware, so we replaced disks, then the entire machine. But it continued to happen. Eventually it turned out it was a bug and with specific pattern written to the disk, when MariaDB read that block back it incorrectly thought the block was encrypted and discarded it. Updating to a version that fixed it, responded the issue, but that's really a nasty bug that caused data corruption.

Postgresql on the other hand with my experience always was nice even in bad situations. When we had network issue (due to failed switch) one the problem was fixed the database reconnected back and trained replication, MySQL in similar situation required to tell it to reconnect again, some even setting up replication again (which requires restoring backup on slave and reconfiguring replication again, which required few hours).

In another instance we had main postgresql set up replication to a DR site. The database data was stored on filesystem provided by NetApp. A consultant had a task to deprecate filesystems that were no longer used, so he actually put them offline for 30 days, before deleting them. Around 20 days in we noticed that the master was running out of space. Turns out the disk was utilized by unshipped logs to the standby. Once filesystem at the DR site was put back online everything resolved all by itself.

pgaddict · on Sept 7, 2020

I haven't down-voted the post, but I'd say it's a bit misleading.

Firstly, it's not quite true "Uber migrated away because of write amplification" because they named multiple other reasons contributing to that decision. (And yes, not all of those reasons were quite valid, but that's a different discussion.)

Secondly, it's patently false that Postgres sets up local replication writing into a local file. That's plain bogus.

ants_a · on Sept 7, 2020

To add to others, writing a file over and over doesn't do anything on pretty much any solid state storage device you would use for a database. You would have to look at something like a bargain basement SD card to find something that doesn't have wear leveling.

nopurpose · on Sept 7, 2020

With deduplicatio enabled, how does it differ from GIN index now?

petergeoghegan · on Sept 7, 2020

GIN is optimized for things like full text search and jsonb indexing. GIN's internal B-Tree/entry tree consists of datums that are decomposed from the data type you're actually indexing rather than the data type itself (e.g. when indexing a column that's an array of ints, the entry tree is actually plain integers).

You can now use B-Tree to get something that's roughly comparable to GIN + contrib/btree_gin, though only when indexing a single column -- multi-column indexing in GIN stores each datum from the same row in totally different parts of the tree. This is totally different from B-Tree, which stores all columns that are indexed for the same row together.

B-Tree deduplication is only capable of making the index about 3x smaller (though larger reductions are possible when indexing large ), whereas GIN can sometimes manage about 10x because it also compresses TID lists. This on-disk size difference is only apparent when there are a huge number of duplicates. GIN now has a higher on-disk overhead when there are only 3 or 4 duplicates per item.

In summary, B-Tree is a performance all-rounder that has good, consistent read and write performance. GIN is more optimized for writes, and for specialized applications like FTS.

avremel · on Sept 7, 2020

> The big difference is that duplicate index entries can still occur in B-tree indexes. GIN indexes have to integrate new entries into the posting list every time, which makes data modifications less efficient (this is mitigated by the pending list). In contrast, PostgreSQL deduplicates B-tree entries only when it would otherwise have to split the index page. So it has to do the extra work only occasionally, and only when it would have to do extra work anyway. This extra work is balanced by the reduced need for page splits and the resulting smaller index size.

https://www.cybertec-postgresql.com/en/b-tree-index-deduplic...

pgaddict · on Sept 7, 2020

Interesting question.

Considering how different the gid and btree indexes are, I'd say it's very different. GIN essentially builds a posting list for each value in the column, and compresses that (which might be seen as a kind of a bitmap index). OTOH btree deduplication simply compares index keys that end up on the same leaf page. So there's on "global" index compression, it's much more localized.

But I suppose Peter and Anastasia have more thoughts on this ...