It doesn't seem procedural to me, as there is no rebinding. Such a sequence of assignments would look just as home in a functional language like Lisp, ML, or Haskell as Python. In procedural languages, idiomatically, you have mutation, in which variables are re-bound to new values, and side-effects, in which the external environment and the program can interact with each other in ways that are unconstrained.
I guess the point is you could just stick the results of each step that is currently a CTE into #table as separate queries, and you're only doing the work once to step through the stages?
From a debug point of view that feels more convenient to _me_.
I guess it depends, materialising all the steps might be slow and you might need to do a refactor to get a CTE-based query. If you want just want to check a few (maybe changing) records because you expect problems there a final select with an appropriate where is probably faster.
Edit: also I don’t like to leave a mess behind and CTE’s don’t require cleaning up afterwards.
With CTEs, the optimizer won't execute the queries but create one optimized version. If you materialize them, execution times can be many magnitudes higher in cases you only end up using small parts of the queries.
CTEs (common table expressions) are wonderful, they make SQL grokkable and maintainable. But, before some dev goes crazy refactoring all of the SQL in their codebase into maintainable CTEs: always benchmark. CTEs can cause your query planner to optimize incorrectly. In some cases, CTEs can force the query optimizer to choose a plan it otherwise is not choosing and be more performant - perhaps up to a certain point. (That point being the size of the tables in question/index/etc.)
I had some code from a vendor that was hanging in SQL Server and I looked at the code and they had composed everything together using smaller CTE queries and it took forever to run. They had taken various complex criteria and executed one query for each and then combined the results -- I ended up spending a day refactoring the whole thing to single SELECT with all the criteria and it ran instantly.
This was for a nightly data importer and it was filled with queries like that -- I only fixed the one giving us issues -- but I bet I could have reduced the total runtime by 90%.
Indeed, SQL is tricky like that because there are plenty of performance consequences for a language that is meant to be declarative and optimized by a query planner.
Unless you have realistic data in the test database the performance behavior of a query can be unpredictable.
In SQL you're describing the result that you want -- you aren't (and shouldn't be) saying how to achieve that result. The engine goes off and (hopefully) finds the optimal way to create the result you are describing. When you break queries down too far, the engine has more trouble finding the optimal path.
SQL Server can "see through" views and CTEs optimize them combined with a query but depending on the calculations it might have to generate and process intermediate data that it doesn't need. In my case above, the engine was reprocessing the same data set over and over, generating duplicate rows, and combining everything removing the duplicates.
My query was significantly simpler but also, in a way, harder to follow. It wasn't as clear what the intent was. Where they had nice named CTE subqueries, I had to use comments.
I agree but all systems have limitations and SQL engines get better at the job every year.
The result of the query is say declaration X. In this case, imagine you can construct a query to formulate X with out a CTE. The database goes off and computes X from the data.
But if you start adding CTE's for say readability or because you're building up the query piece by piece and testing it in stages. You might be adding additional declarations A, B, and C and combining them to ultimately get X. In theory, the database engine should be able to see that you're still trying to get X and perform the query that same way as if it was a single query without a CTE. And, in some cases, it can actually do that.
But if A, B, and C are complicated enough then it's just unable to determine that it all equals X in the end. This is a fairly understandable limitation of the engine -- SQL can get pretty complicated. It's also making the assumption that all these CTE's are actually needed to compute the result. It took me a whole day to convert a CTE based query into a non-CTE based query and get all the same results in the same way.
Yes, the more complicated it is, the more likely the optimizer is to get confused... but also the greater the likelihood of bugs and the greater the benefit of something that makes your query easier to read.
I have strongly encouraged reckless use of CTEs throughout our product. We use in-memory instances of SQLite to evaluate any queries which would leverage CTEs. These datasets are usually small enough to reside within L1 (and certainly within L2).
“By the time it doesn’t fit in [L2/RAM] you will have a different order of magnitude of engineering team” is one of my favorite adages. Arguably a bit less true with the venture model of growth-at-all-costs showing some cracks, but still very much a good mental model!
In my experience a huge portion of the userbase who loves CTEs are analysts or devs running queries on decently large datasets, and they mostly like them because of readability and don't understand performance possibilities.
I appreciate that tools like dbt allow materialization options as both CTEs and views/tables/etc because being able to pivot between them is super nice.
Note that there are basically no special performance considerations when using CTEs in recent versions of postgres… unless your CTE is recursive or does some weird side-effect (which is unlikely).
This is incorrect. If you use a CTE more than once in a query, Postgres will materialize that data into a temp table that has no indexes. If that CTE gets joined a lot, it will incur many O(N) searched through unindexed data.
I've seen cases where a few GBs of product data gets pulled into a CTE this way. Queries take 10s to complete from processing GBs of unindexed data
Wow, this is amazing. Just the other day I was staring at a query with multiple CTEs that took 20+ seconds to run. Ended up adding an additional WHERE clause on an indexed column to the slowest CTE, which took the execution time down to 100s of ms.
Just tried the original query, but with explicitly not materialized CTEs and the runtime was basically as fast as the manually pre-filtered solution.
No, but this would make an incredible Postgres extension.
(I’d also love one that lets Postgres use a covering index for primary storage. This can reduce data size and load less data off disk due to locality. MySQL has this by default and SQLite has it via WITHOUT ROWID tables.)
Strange take you have there in assuming they don't understand performance possibilities. Many of the problems that analysts and such work on aren't performance bound because focusing on the performance would be a net waste of time. "Don't optimize early" and all that. For example, I had to share SQL recently where the absolute most important thing, beyond all else was to make it readable. CTEs fit the role.
Postgres has materialization options for CTEs. You can ask it to rewrite the CTE as a join via "WITH foobar AS NOT MATERIALIZED (...)". Alternatively you can force materialization by writing "WITH foobar AS MATERIALIZED (...)"
Based on other comments, this has been done for psql, or at least to some extent. So that's the other thing... pre-emptively writing code in a non-intuitive way for performance reasons often leads to practices sticking around long past the rationale for them even being true.
I wrote some very elegant, readable SQL to perform a complex query. It was dog slow.
I handed it to a DBA, they ripped out the CTEs and replaced them all with temp tables. The query was an unreadable mess at the end, but boy was it orders of magnitude faster.
Did they do something else other than just popping CTE queries into temp tables? In my experience, _just_ doing that doesn't affect readability much at all.
This is a great idea if you disable materialization (which the author of this post does not mention).
Yes, materialization is fine for small hardcoded values in the post, but for most other lookup tables, it's definitely not fine.
For example, you read this article and think great, I'll make a lookup table to map various ids to be linked across tables.
with ids as (
select u.user_id, u.token, s.customer_id
from users u
join stripe_customer s on s.user_id = u.user_id
),
user_impressions as (
select i.*, ids.id
from impressions i join ids using token
),
payments as (
select p.*, ids.id from payments p
join ids using customer_id
),
select sum(i.views),
sum(i.clicks),
sum(p.amount)
from payments p
join user_impressions i using user_id
where user_id = $1
group by date;
You think, great, this gets some stats for one user since its filtered at the end, and with indexes on stripe_customer(user_id), user_impressions(token), and payments(customer_id), this will be really fast and efficient!
Nope.. since ids is referenced more than once in this query, in Postgres this causes the ids CTE to be materialized onto disk with no indexes. So not only does it take up lots of extra space on disk to store all the users in your system, to join user_impressions and payments you have to O(N) search across the ids dataset. No indexes on a materialized table.
It would be a lot faster to join to users or stripe_customer or both in a loop with indexes than O(N) search through all the users in your system.
This is particularly dangerous because if you remove the payments part, ids only has one reference, so Postgres doesn't materialize the table, and everything is fast.
This can be fixed by adding WITH foobar AS NOT MATERIALIZED to the CTE syntax. IMO it should be a syntax error to not specify AS MATERIALIZED or AS NOT MATERIALIZED.. The default has too many potential performance problems and folks should have to think about whether they want materialization or not.
side note: I asked chatgpt to rewrite this query to use subqueries and it did a reasonable job.
next step: include cte-to-subquery translation step as a part of your build pipeline. never needed tools like that but i guess they must exist since it's been such a common issue. using gpt for this is like nuking a mosquito from orbit.
Seems like an interesting idea, but could use a better example, at least for those who aren't yet intermediate level in SQL.
In what world is this
WITH countries (code, name) AS (
...> SELECT \* FROM (VALUES
...> ('us', 'United States'), ('fr', 'France'), ('in', 'India')
...> ) AS codes
...> )
...> SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;
easier to read than this
SELECT code,
...> CASE code
...> WHEN 'us' THEN 'United States'
...> WHEN 'fr' THEN 'France'
...> WHEN 'in' THEN 'India'
...> END AS country
...> FROM data;
Yea this is a common problem with contrived code samples like these. In my experience the author is right in that real world examples of using CASE can often get out of hand - especially if you need it in more than one place. But the contrived, simple example is clearly easier than the CTE.
It is perhaps not surprising that we end up with this sort of stuff in production code, because the original author only needed one simple CASE statement and then it organically grew from there.
Occasionally, SQL surprises with bits of composability, such as the fact that using VALUES to specify literal tuples can be used with both INSERT and the FROM clause of a SELECT.
Is there any reason syntactically the SELECT needs to be required? If you use a VALUES table literal as a subselect, you have to give it column names with an AS clause (see https://www.postgresql.org/docs/current/sql-values.html). I can imagine a simpler syntax where you'd do WITH VALUES ... AS table_name (column_1_name, ...).
Is there any reason to alias the lookup table as `codes`?
sqlite doesn't have linear regression functions, and doing the math manually is a bit awkward because "b" relies on "m". Instead of duplicating the math to calculate "b", here's how to do it with CTEs:
CREATE TABLE vals (x, y);
INSERT INTO vals VALUES (1, 1), (2, 0.5), (3, 0.4), (4, 0.1), (5, 0);
WITH
m(v) AS (
SELECT ((COUNT(*) * (SUM(x * y))) - (SUM(x) * SUM(y))) / ((COUNT(*) * SUM(POW(x, 2))) - (POW(SUM(x), 2)))
FROM vals
),
b(v) AS (
SELECT (SUM(y) - (m.v * SUM(x))) / COUNT(*)
FROM vals JOIN m
)
SELECT
x AS real_x,
y AS real_y,
m.v AS m,
b.v AS b,
x * m.v + b.v AS interp_y
FROM vals JOIN m JOIN b;
real_x real_y m b interp_y
------ ------ ----- ---- -------------------
1 1 -0.24 1.12 0.88
2 0.5 -0.24 1.12 0.64
3 0.4 -0.24 1.12 0.4
4 0.1 -0.24 1.12 0.16
5 0 -0.24 1.12 -0.0800000000000001
If all you want is "m" and "b" the final query can just be "FROM m JOIN b" and the result will be 1 row.
I've tried to get myself writing a few times but never succeeded, so I don't have a blog. But - I am actually doing parameters like so in the view where I copied the above from:
CREATE TABLE view_confs (key, value);
INSERT INTO view_confs VALUE ('start', 2), ('end', 4);
Now prefix it with another one:
vals_range(x, y) AS (
SELECT x, y
FROM vals
WHERE x >= (SELECT value FROM view_confs WHERE key = 'start')
AND x < (SELECT value FROM view_confs WHERE key = 'end')
),
And swap the rest of the original query to use "vals_range" instead of "vals". Just gotta remember to update view_confs with the new range whenever you want to change the view. I think it should work inside a transaction to avoid multiple threads interfering with each other, but also to be clear: This is a workaround for sqlite not having stored procedures, and not wanting to implement it in code (so I can JOIN to the view in other queries). Better to use stored procedures in other databases than this workaround.
Removing CTEs from your codebase by replacing them with the creation of a temp table and then using separate queries with as few joins as possible to populate it will give you low-locking performance hundreds of times better nearly every time. No problem with readability.
I see this pattern all the time in mssql code, but hardly ever elsewhere. I've always wondered if there is a reason for that. Are you coming from that background? I'm curious if anyone knows why it is so favored there?
I'm not sure I agree re performance btw. I've fixed a lot of slow mssql queries by changing them in the other direction (EDIT: or just adding a join). I do see how temp tables might shorten the time you hold locks---but then you might be subverting your transaction isolation level.
I have also seen the performance boost by using separate temp tables rather than CTEs (in my case in Redshift).
My hypothesis is that while memory use my be the same either way, you don't have the same transactional/locking requirements/overhead/consistency with multi statements as you do with a single big set of CTEs. And for analytics you rarely need that transactional isolation required by a single big CTE.
That's because Redshift doesn't support MATERIALIZED CTEs. In mainline Postgres, a MATERIALIZED CTE acts more like a temporary table with the performance characteristics you're looking for.
Are you putting indexes on your temporary tables? Are you including the time it takes to create them?
I can't speak for others, but most of my CTEs tend to be small enough where an index wouldn't help much. Of course toggling MATERIALIZED is an easy way to see whether the equivalent of a temp table would work best or if the planner can successfully include the body of CTE in optimization. A lot harder syntactically to swap out a subquery for a temp table, which is what the MATERIALIZED keyword essentially does under the hood (at least in Postgres).
Hell, on small tables Postgres often won’t do an index scan even when one exists because it’s faster to do a linear scan and not an index read + fetchs - or at least the planner thinks so.
It's not necessarily about locking, it's a way to control what order joins get done in, which, at least in MS SQL, is up to the query engine to decide. By writing to a temp table you can override the query engine's plan, often with huge performance gains.
Less known is that you can get the same effect using a TOP N statement where N is a number larger than the rows you might get back. This presumably avoids the additional i/o penalty of the temp table creation, but I've never validated that. This trick likely is portable across db engines, too.
It depends on the use of the database. Where there is intense transactional writes (e.g. case management system, etc.) the CTE approach can easily get locked up, especially if the write queries do a lot of their own lookups. I have come from an MSSQL background, so maybe that is an artifact either of the DB or the way it is used in my own history :) Good question though, and I'd like to hear from someone who has experience in other DBs.
WRT subverting the transaction isolation level, using temp tables is a tacit decision to do so and best for non-vital read queries in my own experience. That is, I used it for lookups of tables that did not _need_ to be completely up to date, or when I knew the data _would_ be up to date.
I find that anyone who praises chaining CTEs has not dealt with how terrible they can become, each sub table being able to take on any dependency in the parent tables is not a feature of going fast.
I've used CTEs fairly extensively so I'm generally familiar with their pitfalls, but I don't quite grok your point here. Can you go into a little more detail about what you mean by "take on any dependency in the parent tables"?
In cases like in the example given, I would also consider using a generated column (based on the CASE expression, in the example). That way the CTE doesn’t need to be repeated on each affected query — any query on the table can directly reference the derived value. Of course, this approach is only applicable for row-based values of a single table, and requires you to be able to modify the schema to add the generated column.
This also works in MySQL 8.0.19 and above but the syntax of the CTE example needs to be tweaked to use the ROW() constructor:
WITH countries (code, name) AS (
SELECT * FROM (VALUES
ROW('us', 'United States'),
ROW('fr', 'France'),
ROW('in', 'India')
) AS codes
)
SELECT data.code, name FROM data LEFT JOIN countries.code = data.code;
It creates a lookup table. When you are using the CASE operator, in simple cases it will use the `transform` function under the hood.
There are more advanced capabilities:
- The `Join` table engine. It is a pre-warmed state for joining, a hash table kept in memory.
- Dictionaries. Pre-warmed, automatically updated lookup data structures from various sources. For example, you can connect a dictionary of company names from your operational database while keeping only the ids in ClickHouse.
And, just in case, the same CTE works as in SQLite:
WITH countries AS (
SELECT c1 AS code, c2 AS name FROM VALUES(
('us', 'United States'), ('fr', 'France'), ('in', 'India')))
SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;
Example:
milovidov-desktop :) WITH countries AS (
SELECT c1 AS code, c2 AS name FROM VALUES(
('us', 'United States'), ('fr', 'France'), ('in', 'India')))
SELECT data.code, name FROM (SELECT 'us' AS code) AS data LEFT JOIN countries ON countries.code = data.code;
┌─code─┬─name──────────┐
│ us │ United States │
└──────┴───────────────┘
Common table expression. I don't know exactly where the "common" part comes from, but I'm guessing because you can use its alias multiple times in the subsequent CTEs and the main query.
Various comments above expand upon it, but pg12 only changed CTEs which are referenced once to default to NOT MATERIALIZED. Multi-referenced CTEs remain materialized by default.
Nope, there’s no performance downside to using CTEs in recent postgres versions, unless the CTE is recursive or has side effects (which would be weird).
> predicate pushdown is a problem on both MSSQL and Postgresql
Not sure you’re completely on target here regarding CTE performance. I don’t have deep insight into Postgresql (but “create temporary view", yeah!). MSSQL does a pretty good job in Sql2019. If the predicate is sargeable in some way pushdown is reliable. If performance is a concern, examining the residuals can lead to significant insights e.g. applying index filtering which solves obvious problems. Recursion is another story.
CTE is more likely used by data analyst queries because it is an abstraction of composition. It’s not a great abstraction but it’s better than nothing, which is mostly what you get with SQL.
Apologies I'm late to reply to this, but yes. The way each compiler/optimizer handles CTEs can be dramatically different from RDBMS to RDBMS.
For this specific use case though, my comment was made because CASE statements tend to just be bad in comparison to CTEs. Probably dependent on engine though, I don't know them all by heart.
Isn't it more readable to make the code snippet with the case when statements into a CTE? I feel like I am misunderstanding the proposed benefits here.
For the sake of completeness, you can accomplish the same using UNION instead of VALUES and subquery instead of CTE.
‘’’
WITH CTE AS (
SELECT ‘NY’ AS code, ‘New York’ AS state
UNION
SELECT ‘NJ’, ‘New Jersey’
)
SELECT * FROM CTE
‘’’
… likewise you can also do that in a subquery that you can immediately join:
‘’’
SELECT * FROM (
SELECT ‘NY’ AS code, ‘New York’ AS state
UNION
SELECT ‘NJ’, ‘New Jersey’
) SUB
INNER JOIN AnotherTable AT
ON SUB.code = AT.code
‘’’
In any non trivial code select * is a code smell anyway. Thus parsers don’t spend cycles looking for forms that are unlikely to be useful in real code.
CTEs still give me anxiety due to their inability to be potentially not be optimized. I know many database engines today optimize for this type of thing, I'm thankful for it, but there was as time 10+ years ago when CTE expressions could take down databases. I also blame DBT for convincing people that CTEs are good. CTEs are only good if used correctly.
CTEs are just a nicer, more readable way of writing a subquery. Also, CTEs can be recursive. In Postgres, CTEs can be useful when used with the RETURNING clause. If you're using the same CTE in multiple different queries, then consider creating a view instead. I wouldn’t use a CTE for lookup values, create a lookup table instead.
Worth noting that "Row values were added to SQLite version 3.15.0 (2016-10-14)" - my experience is that older versions of SQLite are a lot more common than I would hope they would be. https://www.sqlite.org/rowvalue.html
It's better when you need a lookup to multiple columns but will not be needed in any other query/ETL job. Might be steps in a funnel or values in a bespoke customer discount query.
Some DBs do allow struct/array fields but joining against a constant CTE is portable and has low overhead.
I like CTEs, but for complicated queries I find it lot easier to make them run fast by splitting into individual queries and creating temporary tables with appropriate indexes.
Well, if you have multiple queries, you'd still have to write the cte multiple times. You could also use views for the same thing, and those are much more reusable.
Yeah. In many cases I've had to use CTEs like this in BI tools. When you're experimenting with datasets for dashboards it's much faster to work with CTEs than to try to make production DB changes.
I agree, I'm often using CTEs to organize sub-result-sets because asking for a production database change / view change requires a ticket, and it's just way faster to iterate in a few CTEs in a single query.
If, eventually, a few CTE distillations become common enough, then yeah, that's grounds for a request to basically shove that CTE into a view so other people can use it.