Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Query Optimization (jinchengli.me)
135 points by craigkerstiens on April 9, 2016 | hide | past | favorite | 12 comments



Maybe a very newbie question but I've always been wondering:

As query optimizer can't explore the whole possibility tree because of time constraint. Would it be possible theoretically (and if so, has any RDBMS implemented it) to do like for prepare statement "tell the optimizer to take its time to give the best optimized code it can" and even, let's be crazy crafting some JIT-ish code heavily optimized just for that one request.

I'm certainly missing a lot of things, but it would certainly make sense for webservices where you have very limited set of "critical" queries that you may want to tell your RDBMS to optimize the best it can regarding current set of data at service startup.


To do this, you would have to prepare a plan once and then use it for many queries. The problem with that is that you can't make use of query-specific information when making the plan, and that can be crucial to picking the right plan.

For example, if you have a query like 'select * from products where category = ?', then you basically just have to choose between a table scan and an index lookup. Which is best will depend on how selective the predicate is: if the category in question only covers one in a million products, an index lookup is probably be faster, but if it covers a quarter of them, then a table scan will probably be faster.

If you know from statistics that no category covers more than a small fraction of products, then sure, prepare an index lookup plan for all queries, and vice versa if you know that all categories cover a substantial fraction of all products. But if you have a spread, with some categories being big and some small, then you are best off leaving the selection of the plan to runtime.

It's much the same situation as with ahead-of-time vs just-in-time compilation, but with a bigger penalty for picking the wrong plan, and a smaller ratio of time spent planning to time spent executing, both of which push towards deferring the decision.

That said, perhaps a good strategy would be to do what JIT compilers can do with dynamic deoptimisation: pick a plan that looks good, but guard it with assertions about the things which vary at runtime. So, with the example above, if your categories are mostly small, prepare a plan which uses an index, but guard it with a check that the category is in fact small. If a query comes along for a big category, ignore the prepared plan and make a new one from scratch. Then track statistics about the hit rates on the guard conditions, and if it looks like different assumptions would lead to a more useful plan, switch to those. You could even have multiple plans for different conditions.

As for crazily crafting some JIT-ish code heavily optimized just for that one request, people are already working on that (the slides are in Russian, but the abstract is still interesting!):

https://pgconf.ru/en/2016/89652


There's been discussion of such a feature but AFAIK it's not implemented.

If you consider a data warehousing situation where queries may run for multiple minutes (or longer!) adding a second of additional processing to come up with a better plan can be well worth it. The reverse situation is an OLTP plan where the query itself will execute in 1ms; additional planning time is pointless.

A more complicated (though if you ask me wayyy cooler) approach is to spend X time units finding a decent plan, start executing the query, and then keep planning in the background. If you come across a much better plan and you're not done yet, switch to that plan instead. The MVCC semantics of Postgres would allow for this as the new plan would see the same data "as of" when the query started. Again this isn't implemented yet but has been tossed around.


Postgres doesn't alott (as far as I know) the planner a maximum amount of time. Rather, it picks the algorithm to use based on the number of joins.

The computational complexity of the default algorithm used by the query optimizer depends on the number of joins, and will explode if the number of joins gets high. Thus, if the number of joins exceeds a certain value (the default is 12), it will switch to use the genetic query optimizer (GEQO) [1], which uses genetic algorithms to randomly search towards an optimal plan. GEQO isn't constrained by time, either, but programmed with a maximum number of generations.

[1] http://www.postgresql.org/docs/9.0/static/geqo.html


Oracle takes this to its logical extreme by allowing pre-queries that sample some of the data, so it can change the query plan on the fly if it's taking too long. It's a very powerful feature, but for many workloads it's not worth it. It's worth noting that better statistics and data sampling are usually more important to producing a great plan than exhaustively exploring, e.g., join order (especially for things like many-to-many joins).


Postgres offers a variety of session-based tunings for their genetic query optimizer. See items such as geqo_effort. Note that postgres caches query plans, so the query planning cost for a well-constructed query will only be paid once(ish).

http://www.postgresql.org/docs/9.2/static/runtime-config-que...


The main benchmarks that these compilers compete on are those from http://www.tpc.org/. They are synthetic datasets which are fairly naively generated. http://www.vldb.org/pvldb/vol9/p204-leis.pdf is a really interesting paper that looks at how some of the heuristics used in those compilers work well on TPC but break down on real datasets. It can give you a good intuition of when you have to be wary of bad plans.


As someone who spends a fair amount of time in databases but also large scale mathematical optimization, it's been a goal of mine to dig into the Postgres code and start hacking on the query optimizer. From a very quick overview, I see tons of potential for huge improvements.

1) The query optimizer doesn't explore all plans, but it also doesn't need to. There are very smart algorithms for exploring branches of microptimizations only when they are likely to improve upon an existing plan, and they range from local search to genetic algorithms to branch-and-cut column generation. Outside of a very limited application of Genetic Algorithms which most users won't ever see used, Postgres doesn't do this very smartly.

2) Most production databases have the same set of queries running on them 99% of the time. Queries are optimized on the spot, or pre-optimized but missing important runtime information. Postgres could take a hybrid approach...pre-optimize queries, but explore LP duality models of the optimization to discover where the thresholds are for alternative plans. For example, use one query plan for all values of columnA, except when columnA = "FOO" or columnB > 42.

3) The statistics that are made available to the optimizer are very rudimentary. Multi-column statistics and conditional probability statistics, for example, are missing. They are left out because gathering such statistics can be very expensive. Understanding cardinality can help understand which statistic improvements could be cheaply obtained, and machine learning models and LP duality information could help determine which of the more expensive sets of statistical information would be valuable for the types of queries that tend to be run on the database. I would estimate that conditional probability collections on multi-column indexes alone could result in massive (order of magnitude) improvements over full table scans.

4) I have not verified this, but I don't think Postgres does any sort of statistical or ML-based analysis on the IO or compute power characteristics of the server. Making this information available could result in huge improvements in query optimization when there is a tradeoff that can be made between IO and Compute (such as how to compress data, where thresholds should be set to activate parallel aggregates and scans, etc.). They could also do some learning on the read-write ratios on a table level, which could also offer some important optimizations.

5) There is room for extensions of the SQL table creation syntax that could make available static information that can both be applied as a constraint but also used for query optimization. For example, lets say you have a multi-column key (or even not a key, but just known in advance) where the information is purely hierarchical from column1->column2->column3. If that is the case, certain values of column3 would only ever occur if column2 = 'bar' and column1 = 'foo'. Even if your query doesn't ever touch column1, the query optimizer could possibly choose to not evaluate a predicate on column3 until it knows that column2 = 'bar' and column1 = 'foo'. And as a benefit, you could get a crucial DML runtime error if that ever magically changes on you.

Now, I just need to learn some C and find some time to hack on it :)


As for number 5 - Postgres already does that. This feature is called "constraints". Such constraints can be applied to multiple columns and are used to build better query plans (as well as limiting DML operations)


For comparison, SQLite's optimizer is described in https://www.sqlite.org/optoverview.html and, linked from it, https://www.sqlite.org/queryplanner-ng.html.

SQLite doesn't want to spend time considering all possible join combinations, accepting the risk that it chooses a plan that is a lot worse than the optimal one.


That's a great overview. This will come in handy for training at work.


And throw in more memory to the system




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: