Speeding Up PostgreSQL with Partial Indexes

infogulch · on Dec 10, 2014

What about one multiple-column index? Using three separate indexes on three columns is almost never ideal so I don't know why it was considered in the first place (unless to make a contrived "normal" to compare partial indexes against).

My guess would be: nearly the same read performance as the partial index, a bit less than half the storage of three separate indexes. The write performance isn't as "free" as the partial index, but it's a third of three separate indexes and still allows for many more types of queries than the partial index.

drob · on Dec 10, 2014

Separate single-column indexes can be reused more readily. In this scenario, you might have a few different types of "high value" events, and you'd want your indexes to be applicable to more than one of them.

A multi-column index on (a, b, c) can't be used for a query that filters on b and c but not a. More detailed documentation here: http://www.postgresql.org/docs/9.3/static/indexes-multicolum...

infogulch · on Dec 10, 2014

If you have no clue whatsoever about how your table could ever possibly be queried then sure, three separate indexes would probably be the best way. But most tables are not queried on all possible combinations of columns, but instead have regular query patterns.

Adding N 1-column indexes for N columns on every table is at best is a lazy cop-out for actual data and query analysis.

Edit: to address your example specifically, you'd probably be better off having two indexes, one on (a, b, c) and another (b, c) instead of three separate indexes if that covers your query patterns.

drob · on Dec 10, 2014

This is a scenario in which you do know the query set ahead of time, and there's still a considerable performance benefit to using partial indexes over any combination of multi-column indexes. The main insight is that the "high value" events are a tiny slice of the table -- about 0.05% of rows for each type of event.

Let's say you care about ten different types of "high value" events, which reference a total of six columns. I'll assume we can cover them with three multi-column indexes, although it wouldn't be hard to cook up a realistic scenario in which you'd need more. That means, for each INSERT/UPDATE, you need to write to three different indexes.

Given that the event definitions are selective, a single partial index requires a write for about 0.05% of INSERT/UPDATEs. Ten of these will cumulatively require one write on ~0.5% of inserts -- a 600x improvement over the conservative estimate above. That is, the cost of of maintaining the set of multi-column indexes should be much, much higher than the cost of maintaining the set of partial indexes.

As I mentioned in the article, the partial index approach also allows a more flexible set of predicates. What if you want to index for rows with a field that matches a fixed regex?

Edit: Apparently I can't respond to your response -- does HN have a chain length limit?

Yes -- for indexing the single event definition used for the profiling here, a multi-column index would absolutely be preferable to three single-column indexes. I didn't think to include the option because I assumed it wouldn't scale to a case in which you have ten different event types that use an overlapping but not identical set of fields. Definitely would have been a good idea to include a comment to this effect in the post.

infogulch · on Dec 10, 2014

@Your edit. (I don't think there's a nesting-limit, but a minimum delay before you can reply.)

That makes a bit more sense. I would challenge your phrases "realistic scenarios," "assumed it wouldn't scale," and "overlapping but not identical set of fields," but the real problem we're facing in this thread is all our scenarios are hypothetical and simplified. The fact is, indexing decisions are highly dependent on the exact scenario and since we have different contexts the best we could do is banter on about the "most accurate" hypothetical scenario, which is nonsense.

So I guess my conclusions would be "highly selective partial indexes could save you a bunch of write io" and "otherwise, It Depends™ because Indexes Are Hard®". In any case, your article is a decent introduction to partial indexes in general.

drob · on Dec 10, 2014

Yep, looks like a minimum delay. Interesting!

Agreed in general. No one-size-fits-all answers, and schematizing your data well is going to require case-by-case attention and experimentation for the foreseeable future.

drob · on Dec 13, 2014

I'm silly -- turns out I did mention this in the post. See footnote 3.

The multi-column index takes up 755 mb, vs 1026 mb for the set of three single-column indexes.

infogulch · on Dec 10, 2014

I didn't mean to argue against the viability of partial indexes in general, just that comparing them to a very sub-optimal set of full indexes for read performance was not representative of full indexes.

And if you reread my first comment, I fully acknowledge that the write performance of partial indexes would likely still be better than a single full index.

wiredfool · on Dec 10, 2014

But a multicolumn index on (b, c, a) would be effective for a query on (b, c), as well as on (a, b, c).

MSM · on Dec 10, 2014

but not for any combination that doesn't have B leading.

If you're always going to query those fields in that order, that's a perfect way to do things. If you'll be querying it a couple different ways then you're way better off with three separate one column indexes assuming the RDBMS can make use of index intersection.

jasonl99 · on Dec 10, 2014

How about a multiple column AND partial index? (that may have been what you are suggesting). Best of both worlds? I just tried it, and it does work (at least on 9.4). Here's an extremely contrived example with a rails 'users' table

create index test_index on users (name, email) where name > 'A' and email > 'B'

chdir · on Dec 11, 2014

In case of Django, the RunSQL command [1] in migrations makes it very easy to use partial indexes. There are a couple of examples on Stackoverflow. Definitely a low hanging fruit and worth those 10-20 minutes to setup.

[1] https://docs.djangoproject.com/en/dev/ref/migration-operatio...

dsugarman · on Dec 10, 2014

this is a really cool feature I did not know about. just curious, why do you use such crazy json in postgres? why not store the data with relational database standards?

ikawe · on Dec 10, 2014

I've found it handy for storing configuration data.

E.g. user preferences

    preferences = { 
      "notifications": {
        "email": true, 
        "sms": false 
      },
      "stay_signed_in": true,
      "editor_plugin": {
        "indent_level": 2,
        "autosave": false
      }
    }

It's structured, adhoc, and "less likely" to be queried against. Of course you could implement this in normalized tables, but it's a pain, and I don't see any real value in this case.

mcorrand · on Dec 10, 2014

DISCLAIMER: I'm not a strong dba, but I'll try my hand at an answer. Please let me know if I'm talking nonsense.

It would be useful if the data meets two requirements:

-you know that it won't be queried in a relational way, but

-you don't know how the schema might evolve (beyond the fields you use in your partial index, that is) and want to save yourself the hassle of frequent schema migrations.

calpaterson · on Dec 10, 2014

I've been on a project that used postgres' json type as it's main data storage. I wouldn't go down that route again, even now that jsonb is available. My thoughts:

It doesn't really make sense to say that a query won't be relational. Few result sets are ever relational - if they include a JOIN or really any grouping, sorting etc, they are a non-relational projection of your data. That's the whole point of a relational database - relational storage and ad hoc non-relational views of it.

If you don't need to migrate your data when you change the schema (ie: nothing has gone live yet), just truncate the entire database every time you change schema. No need to write migrations and you get to iterate towards a good relational schema.

If you do need to migrate your data when you change the schema, do not under any circumstances choose JSON types. Existing migration tools like Flyway, Alembic, etc only support migrations between relations and you will have to write your own, custom, half-baked and buggy schema migrations. Migrating even relational data is hard and migrating non-relational data is much harder. You won't have the support of postgres' fantastic transactional DDL so it will be hard to do migrations without downtime. Probably the only approach that will work is to migrate data as you touch it.

Even if you decide to go for JSON, make sure you understand the limitations of the JSON type in 9.3 and lower. Equality isn't defined on JSON (you have to coerce to text first) so lots of basic SQL features won't work - for example UNION or EXCEPT. 9.4 has JSONB which is a lot better.

laurencerowe · on Dec 11, 2014

I'm also using the Postgres JSON type in 9.3, but my experience is different, perhaps because I come from the ZODB world and so am not under any illusions that it is a panacea!

I've ended up with a very generalised schema, all content is in JSON. I've not had to change the Postgres schema for over a year and it's easy to upgrade schemas inline while fetching from the app. We did write our own migration tools, but they're only a couple of pages of code.

I don't understand your complaint about lack of transactional DDL when using JSON. It's still there, though probably not as important as most of your upgrades will only touch data, giving you transactional upgrades (should you desire them) in any database.

DB queries of JSON data are definitely slow in 9.3 as the values need to be parsed every query, but the lack of JSON equality has not been a problem. You simply use the double chevron text extraction operators rather than the the json extraction operators in your query, col->foo->>bar rather than col->foo->bar. From there you can cast to any type and join to other tables.

For us, the big advantage of using a generalized schema has been simple support for recording versions of data. With over 30 different content types doing this with a traditional relational schema would have resulted in maybe 90 tables. Our generic schema has six.

I'm really looking forward to 9.4 as it will make ad-hoc queries across JSONB really fast using GIN indexes. I'd love to see the option of creating check / foreign-key constraints on JSON data in the future. Unfortunately it looks like the proposal for this is stalled as the implementation turned out to be rather tricky. http://blog.2ndquadrant.com/postgresql-9-3-development-array...

mcorrand · on Dec 11, 2014

Thank you for clarifying the relational aspect. This is what I meant to say but I couldn't find a concise way to do so.

drob · on Dec 10, 2014

In this example, a relational schema would work fine, and might have been easier to read. We use HSTORE at heap (until JSONB ships!) because we're processing event blobs with thousands of different fields, of which most are irrelevant for most events.

The added flexibility benefit is also nice. The ability to add a new property without migrating a schema has saved a lot of work.

arthursilva · on Dec 10, 2014

type is probably an excellent candidate to get into the schema

losvedir · on Dec 10, 2014

This is a cool feature.

Unfortunately, when I looked into this for our codebase, I found that it doesn't have full support in Rails 3. You can make the index, via a migration, but rails won't include it properly in your `schema.rb` file. So restoring via schema (as opposed to running through all your migrations from scratch) or just trying to understand your database by looking at schema.rb won't give you a full picture.

However, looking at `add_index` in rails 4, it seems to support it!

edit: one thing I did try out was you can change your schema file from `schema.rb` to `schema.sql`, which supposedly fixes it, but I had some issues with that which I don't remember at the moment.

sanderjd · on Dec 10, 2014

There's a simple solution to this:

    config.active_record.schema_format = :sql

In my opinion, this should be the default. I'm really not sure what advantage the schema DSL in schema.rb has over the ability to express the full schema in its native tongue.

Edit: Just saw your edit. This[0] may have been the problem you had with using :sql as the schema format. The counter-intuitive workaround is to output ruby in production. It seems weird, but doesn't hurt anything.

[0]: http://stackoverflow.com/questions/17300341/migrate-not-work...

ikawe · on Dec 10, 2014

> I'm really not sure what advantage the schema DSL in schema.rb has over the ability to express the full schema in its native tongue.

I always end up using the sql format, because sooner or later I do something remotely non-trivial in postgres. But I'll try to answer your implicit question...

schema.rb works just fine for sufficiently simple schema, it's easier to make sense of in version control, and it's backend agnostic, so if someone wants to develop the app on sqlite - that's just fine.

Of course, all these things become a downside when your setup is sufficiently complex - but if you are not using any interesting database features it's fine. In particular, I would almost never recommend developing against a different DB than you are deploying on. Sooner or later you'll be bit by a behavioral difference.

andrewvc · on Dec 11, 2014

To add to this, the schema.rb or structure.sql files rails makes are never 'accurate'. By which I mean, they simply record the state of the database at the time of the last migration. If you are switching between branches, for instance, it will have columns from both branches.

We gitignore schema.rb because it is really useless. It's a performance optimization that causes more trouble than its worth.

mcorrand · on Dec 10, 2014

The problem of syncing the schema to the orm is a tough one to solve indeed. Good to see that rails will get it right.

Django with South has the same issue, so much so that South preventatively overrides the native manage.py syncdb command with its own migrate command which replays all the migrations.

pjungwir · on Dec 11, 2014

Here is a gem I wrote that uses a separate DSL to track indexes (including partial indexes), check constraints, and foreign keys. Rails 4 is starting to make this gem obsolete, but it still offers more than you get from vanilla Rails.

https://github.com/pjungwir/db_leftovers

drob · on Dec 10, 2014

I wonder if this has something to do with differential support from different DBs. I don't think MySQL has an equivalent feature, though I think SQLite does.