Hacker News new | past | comments | ask | show | jobs | submit login
Understanding database indexes in PostgreSQL (mastermind.dev)
288 points by mrpotato on May 17, 2023 | hide | past | favorite | 18 comments



That was more detailed than I expected, a lot of posts on this topic tend to be more superficial. I suspect the BRIN index might need a bit of a stronger disclaimer, as far as I understand you really want to use that only for ordered data, and in those cases it is exceptionally good at its job. But it is a lot worse if that condition is not met. The post mentions this a bit, but with very soft language.

I disagree a bit with "Don’t index columns if the table has little data", mostly because it doesn't matter in those cases. If the table is tiny the index is also very cheap (unless it's something really weird like a tiny table that is written at a very high frequency). And "little data" is just not specific enough for people to make decisions unless they already have a very good intuition on when the query planner would use such an index.

A rather important part that isn't mentioned about multi-column indexes is which kinds of query can use them. That is probably not obvious if you never read about them in detail, but it's really important to know when defining them.


> I suspect the BRIN index might need a bit of a stronger disclaimer, as far as I understand you really want to use that only for ordered data,

BRIN is generally useful for datasets that have high local correlation. Ordered datasets have that, but it is not unique per se to ordered datasets. The summary type (operator class) you specify when creating the index is what defines which kind of correlation you need:

Minmax (the default for most indexable types in BRIN) needs values of the pages to be closer to eachother than to other ranges (a sorted table has this, but if you'd move the ranges around that would still hold). In the future, this may even be used to support topk-sorts.

Minmax-multi has similar needs as minmax, but has the ability to absorb some outliers in the ranges without losing precision immediately.

Bloom works well for equality checks and benefits most when only few values of the range of valid values are stored in each page range.

For instance, using the bloom operator class, you can use BRIN to exclude large ranges of the table at a fraction of the cost of scanning those ranges manually: it can quickly find out if the tuples in the table ranges might contain results with both date Y and user X, while only storing a fraction of O(sizeof table) instead of the O(num tuples) usually required for indexes.


I use multi-column indexes for things like, “find me the most recent version of this template for this customer”.

I really wish pg had a way to do partial indexes with limits so I could create a partial index that stores, for example, only the most recent version of something (I find this comes up a lot).


Ah, I've had this same use case: I store many versions of a resource from an external API, but I typically only want the most recent version. I've used the following techniques:

Select the ID with the max sync_token. Easiest for Postgres to optimize, assuming you have a primary key of (id, sync_token).

    SELECT *
    FROM external_api
    WHERE id = :my_id 
      AND sync_token = (SELECT max(sync_token) FROM ext_api WHERE id = :my_id)
Define a view using DISTINCT ON. Convenient for ad-hoc querying. Postgres usually figures out it can use the primary key to avoid a full-table scan.

    SELECT DISTINCT ON (id) *
    FROM external_api
    WHERE id = :my_id 
      AND sync_token = (SELECT max(sync_token) FROM ext_api WHERE id = :my_id)
    ORDER BY id, sync_token DESC
For tricky predicates, I use a trigger to track the most recent resource in a separate table. This is a hacky version of incremental view maintenance. [1]

[1]: https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...


Yup exactly. I also keep an eye on the progress of incremental views because there are a lot of use cases. Where possible I avoid triggers, but I’ve also used that solution too.

I’ve found that Postgres often gives the best results if you lateral join to the version table. That way it starts with the primary rows you need and it just hammers the index once for each version row.


> Postgres often gives the best results if you lateral join to the version table

Agreed.


  > SELECT *
  > FROM external_api
  > WHERE id = :my_id 
  >   AND sync_token = (SELECT max(sync_token) FROM ext_api WHERE id = :my_id)
Does this have an advantage over

  SELECT * FROM external_api
    WHERE id = :my_id
    ORDER BY sync_token DESC
    LIMIT 1
? Assuming the index is on (id, sync_token).


For my specific setup and a single row lookup, the ORDER BY ... LIMIT 1 is faster (0.1 ms vs. 1.2 ms).

The ORDER BY ... LIMIT 1 one is the same as the DISTINCT ON query, but the DISTINCT ON can return more than one resource.


Interesting, how did you achieve that with other databases? An enhancement opportunity for PostgreSQL?

For a similar sounding problem, I used a view with row_number() over () with a partition() clause sorted descending so that row number 1 was always the most recent within the partition columns and older were 2, 3 etc. I could then query for the row number = 1 to get the most recent row (or 2 for 2nd etc.). For the most recent only, I had a view which had that row number = 1 condition and I used that most frequently to access data.

To get an index, the view could be materialized but then it needs refreshed but my experience was that had more overhead than just using the regular view.


I haven't used then myself, but Postgres supports partitions. If you have large amounts of data, you could partition by a date range.

https://www.postgresql.org/docs/current/ddl-partitioning.htm...


> for example, only the most recent version of something

Inherently cost prohibitive. Maintaining the index after a delete is O(n) operation.

If you would like to take on some of that burden yourself, you can make a `lastest` bool flag and make a partial index on that.


Also that the order of columns matters in multi-column indices (although some index types can technically use any column), so the first column in the list should be something like a partition key or something that will slice down the data a lot.


And this is even true for some more exotic indexes where you might not expect it like exclusion constraints with gist. It’s much more expensive to maintain with a range as the first column and a highly selective id for the second column than the other way around (you want the highly selective column first)


Perhaps PG is different but for the DB we use the ordering usually wins out over raw selectivity. Meaning it'll prefer an multi-column index that matches the ORDER BY even if it has a fair bit less selectivity.

Due to this we sometimes have to add multiple indexes with different permutations of the same columns.


I've seen a lot of performance gain when services don't cache correctly and are constantly going into small tables to get a record. Just be mindful of the write slowdowns it can cause.


Wow, this is an excellent article.

Two really nice nuggets: how to detect unused and bloated indices.


For NULLable columns, does it generally make sense make a partial index with "WHERE column IS NOT NULL"?


Wow!! This is just awesome about indexing.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: