CTEs as lookup tables

drx · on Jan 11, 2023

CTEs are low-key one of the best features of SQL. Great for debugging big queries, such as:

  with source as (
    select * from wherever
  ),
  transformed as (
    ...
  ),
  joined as (
    ...
  ),
  final as (
    ...
  )
  select * from final

You can switch `final` to `transformed` to see what the query is doing internally. Almost like having good control flow. Almost.

pkoird · on Jan 12, 2023

Almost seems like a procedural syntax like

source = ...

transformed = ...

joined = ...

final = ...

acjohnson55 · on Jan 12, 2023

It doesn't seem procedural to me, as there is no rebinding. Such a sequence of assignments would look just as home in a functional language like Lisp, ML, or Haskell as Python. In procedural languages, idiomatically, you have mutation, in which variables are re-bound to new values, and side-effects, in which the external environment and the program can interact with each other in ways that are unconstrained.

hackernewds · on Jan 12, 2023

I prefer actually materializing the tables so then I can check the output for what the transform tables and the joined tables look like.

generally can't just swap transformed in final because final depends on the output of transformed?

Bootvis · on Jan 12, 2023

The trick is to change the name of the CTE in your final select. This will allow you to inspect that particular step.

This works as long as your CTE’s are correct SQL and only the logic is wrong or suspect.

Ntrails · on Jan 12, 2023

I guess the point is you could just stick the results of each step that is currently a CTE into #table as separate queries, and you're only doing the work once to step through the stages?

From a debug point of view that feels more convenient to _me_.

Bootvis · on Jan 12, 2023

I guess it depends, materialising all the steps might be slow and you might need to do a refactor to get a CTE-based query. If you want just want to check a few (maybe changing) records because you expect problems there a final select with an appropriate where is probably faster.

Edit: also I don’t like to leave a mess behind and CTE’s don’t require cleaning up afterwards.

dx034 · on Jan 12, 2023

With CTEs, the optimizer won't execute the queries but create one optimized version. If you materialize them, execution times can be many magnitudes higher in cases you only end up using small parts of the queries.

cfeduke · on Jan 11, 2023

CTEs (common table expressions) are wonderful, they make SQL grokkable and maintainable. But, before some dev goes crazy refactoring all of the SQL in their codebase into maintainable CTEs: always benchmark. CTEs can cause your query planner to optimize incorrectly. In some cases, CTEs can force the query optimizer to choose a plan it otherwise is not choosing and be more performant - perhaps up to a certain point. (That point being the size of the tables in question/index/etc.)

jvolkman · on Jan 12, 2023

CTEs used to be "optimization fences" in PostgreSQL, but that changed with v12. https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...

DevX101 · on Jan 12, 2023

This blog has a VERY high signal to noise ratio for anyone interested in digging into the PostgreSQL internals. Lots of great articles.

srcreigh · on Jan 12, 2023

They still are sometimes. Whenever you have a CTE which is referenced more than once, to be specific.

heresjohnny · on Jan 12, 2023

Similarly, joining a CTE to itself can also be problematic. In that case materializing to a temp table can be faster.

wvenable · on Jan 12, 2023

I had some code from a vendor that was hanging in SQL Server and I looked at the code and they had composed everything together using smaller CTE queries and it took forever to run. They had taken various complex criteria and executed one query for each and then combined the results -- I ended up spending a day refactoring the whole thing to single SELECT with all the criteria and it ran instantly.

This was for a nightly data importer and it was filled with queries like that -- I only fixed the one giving us issues -- but I bet I could have reduced the total runtime by 90%.

alexvoda · on Jan 12, 2023

Indeed, SQL is tricky like that because there are plenty of performance consequences for a language that is meant to be declarative and optimized by a query planner.

Unless you have realistic data in the test database the performance behavior of a query can be unpredictable.

SQL is very leaky abstraction.

emodendroket · on Jan 12, 2023

I'm also curious how turning some of those into views would compare, but this is also probably dependent on which database.

wvenable · on Jan 12, 2023

I'd say it would be almost as bad.

In SQL you're describing the result that you want -- you aren't (and shouldn't be) saying how to achieve that result. The engine goes off and (hopefully) finds the optimal way to create the result you are describing. When you break queries down too far, the engine has more trouble finding the optimal path.

SQL Server can "see through" views and CTEs optimize them combined with a query but depending on the calculations it might have to generate and process intermediate data that it doesn't need. In my case above, the engine was reprocessing the same data set over and over, generating duplicate rows, and combining everything removing the duplicates.

My query was significantly simpler but also, in a way, harder to follow. It wasn't as clear what the intent was. Where they had nice named CTE subqueries, I had to use comments.

emodendroket · on Jan 12, 2023

If writing CTEs is breaking the optimization I’d argue that’s also an example of not really meeting the “declarative” aspect of the design fully.

wvenable · on Jan 12, 2023

I agree but all systems have limitations and SQL engines get better at the job every year.

The result of the query is say declaration X. In this case, imagine you can construct a query to formulate X with out a CTE. The database goes off and computes X from the data.

But if you start adding CTE's for say readability or because you're building up the query piece by piece and testing it in stages. You might be adding additional declarations A, B, and C and combining them to ultimately get X. In theory, the database engine should be able to see that you're still trying to get X and perform the query that same way as if it was a single query without a CTE. And, in some cases, it can actually do that.

But if A, B, and C are complicated enough then it's just unable to determine that it all equals X in the end. This is a fairly understandable limitation of the engine -- SQL can get pretty complicated. It's also making the assumption that all these CTE's are actually needed to compute the result. It took me a whole day to convert a CTE based query into a non-CTE based query and get all the same results in the same way.

emodendroket · on Jan 13, 2023

Yes, the more complicated it is, the more likely the optimizer is to get confused... but also the greater the likelihood of bugs and the greater the benefit of something that makes your query easier to read.

bob1029 · on Jan 12, 2023

I have strongly encouraged reckless use of CTEs throughout our product. We use in-memory instances of SQLite to evaluate any queries which would leverage CTEs. These datasets are usually small enough to reside within L1 (and certainly within L2).

btown · on Jan 12, 2023

“By the time it doesn’t fit in [L2/RAM] you will have a different order of magnitude of engineering team” is one of my favorite adages. Arguably a bit less true with the venture model of growth-at-all-costs showing some cracks, but still very much a good mental model!

hobs · on Jan 12, 2023

In my experience a huge portion of the userbase who loves CTEs are analysts or devs running queries on decently large datasets, and they mostly like them because of readability and don't understand performance possibilities.

I appreciate that tools like dbt allow materialization options as both CTEs and views/tables/etc because being able to pivot between them is super nice.

epgui · on Jan 12, 2023

Note that there are basically no special performance considerations when using CTEs in recent versions of postgres… unless your CTE is recursive or does some weird side-effect (which is unlikely).

srcreigh · on Jan 12, 2023

This is incorrect. If you use a CTE more than once in a query, Postgres will materialize that data into a temp table that has no indexes. If that CTE gets joined a lot, it will incur many O(N) searched through unindexed data.

I've seen cases where a few GBs of product data gets pulled into a CTE this way. Queries take 10s to complete from processing GBs of unindexed data

Zanfa · on Jan 12, 2023

Wow, this is amazing. Just the other day I was staring at a query with multiple CTEs that took 20+ seconds to run. Ended up adding an additional WHERE clause on an indexed column to the slowest CTE, which took the execution time down to 100s of ms.

Just tried the original query, but with explicitly not materialized CTEs and the runtime was basically as fast as the manually pre-filtered solution.

srcreigh · on Jan 12, 2023

Not materializing lets Postgres push extra where clauses and join conditions into the CTE so this makes sense!

Now you know why I spam all the “CTEs are amazing” threads about materialization (there have been 5 or so in the last few months)

btown · on Jan 12, 2023

Is there a way to force the temp table to have indices within the CTE definition?

srcreigh · on Jan 12, 2023

No, but this would make an incredible Postgres extension.

(I’d also love one that lets Postgres use a covering index for primary storage. This can reduce data size and load less data off disk due to locality. MySQL has this by default and SQLite has it via WITHOUT ROWID tables.)

chaps · on Jan 12, 2023

Strange take you have there in assuming they don't understand performance possibilities. Many of the problems that analysts and such work on aren't performance bound because focusing on the performance would be a net waste of time. "Don't optimize early" and all that. For example, I had to share SQL recently where the absolute most important thing, beyond all else was to make it readable. CTEs fit the role.

srcreigh · on Jan 12, 2023

Postgres has materialization options for CTEs. You can ask it to rewrite the CTE as a join via "WITH foobar AS NOT MATERIALIZED (...)". Alternatively you can force materialization by writing "WITH foobar AS MATERIALIZED (...)"

emodendroket · on Jan 12, 2023

Readability over optimization is often a reasonable compromise, don't you think?

rkagerer · on Jan 12, 2023

So could an optimizer be built that let's you have your cake and eat it too? CTE syntax but without the performance compromise?

emodendroket · on Jan 12, 2023

Based on other comments, this has been done for psql, or at least to some extent. So that's the other thing... pre-emptively writing code in a non-intuitive way for performance reasons often leads to practices sticking around long past the rationale for them even being true.

hobs · on Jan 16, 2023

In my experience no, not in a production database that's doing real work.

TheCleric · on Jan 12, 2023

100% true.

I wrote some very elegant, readable SQL to perform a complex query. It was dog slow.

I handed it to a DBA, they ripped out the CTEs and replaced them all with temp tables. The query was an unreadable mess at the end, but boy was it orders of magnitude faster.

mwigdahl · on Jan 12, 2023

Did they do something else other than just popping CTE queries into temp tables? In my experience, _just_ doing that doesn't affect readability much at all.

TheCleric · on Jan 15, 2023

Not that I recall, but I found it much less readable personally.

recursive · on Jan 12, 2023

Behold, the declarative 4th generational language.

ComodoHacker · on Jan 12, 2023

>always benchmark

…on real or at least representative data.

srcreigh · on Jan 12, 2023

This is a great idea if you disable materialization (which the author of this post does not mention).

Yes, materialization is fine for small hardcoded values in the post, but for most other lookup tables, it's definitely not fine.

For example, you read this article and think great, I'll make a lookup table to map various ids to be linked across tables.

    with ids as (
        select u.user_id, u.token, s.customer_id
        from users u
        join stripe_customer s on s.user_id = u.user_id
    ),
    user_impressions as (
        select i.*, ids.id
        from impressions i join ids using token
    ),
    payments as (
        select p.*, ids.id from payments p
        join ids using customer_id
    ),
    select sum(i.views),
           sum(i.clicks),
           sum(p.amount)
    from payments p
    join user_impressions i using user_id
    where user_id = $1
    group by date;

You think, great, this gets some stats for one user since its filtered at the end, and with indexes on stripe_customer(user_id), user_impressions(token), and payments(customer_id), this will be really fast and efficient!

Nope.. since ids is referenced more than once in this query, in Postgres this causes the ids CTE to be materialized onto disk with no indexes. So not only does it take up lots of extra space on disk to store all the users in your system, to join user_impressions and payments you have to O(N) search across the ids dataset. No indexes on a materialized table.

It would be a lot faster to join to users or stripe_customer or both in a loop with indexes than O(N) search through all the users in your system.

This is particularly dangerous because if you remove the payments part, ids only has one reference, so Postgres doesn't materialize the table, and everything is fast.

This can be fixed by adding WITH foobar AS NOT MATERIALIZED to the CTE syntax. IMO it should be a syntax error to not specify AS MATERIALIZED or AS NOT MATERIALIZED.. The default has too many potential performance problems and folks should have to think about whether they want materialization or not.

cyclotron3k · on Jan 12, 2023

`NOT MATERIALIZED` is now the default in Postgres, and has been since PG12 I believe

srcreigh · on Jan 12, 2023

Only when the CTE is referenced once. If it's referenced more than once, it is automatically materialized.

switchbak · on Jan 12, 2023

This is a pretty huge caveat, one that I'm very thankful to know of. Thanks for the heads up!

baq · on Jan 12, 2023

side note: I asked chatgpt to rewrite this query to use subqueries and it did a reasonable job.

next step: include cte-to-subquery translation step as a part of your build pipeline. never needed tools like that but i guess they must exist since it's been such a common issue. using gpt for this is like nuking a mosquito from orbit.

srcreigh · on Jan 12, 2023

AS NOT MATERIALIZED is equivalent to using sub queries

charlie0 · on Jan 12, 2023

Seems like an interesting idea, but could use a better example, at least for those who aren't yet intermediate level in SQL.

In what world is this

  WITH countries (code, name) AS (
     ...>   SELECT \* FROM (VALUES
     ...>     ('us', 'United States'), ('fr', 'France'), ('in', 'India')
     ...>   ) AS codes
     ...> )
     ...> SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;

easier to read than this

  SELECT code,
     ...> CASE code 
     ...>   WHEN 'us' THEN 'United States'
     ...>   WHEN 'fr' THEN 'France'
     ...>   WHEN 'in' THEN 'India'
     ...>  END AS country
     ...> FROM data;

rco8786 · on Jan 12, 2023

Yea this is a common problem with contrived code samples like these. In my experience the author is right in that real world examples of using CASE can often get out of hand - especially if you need it in more than one place. But the contrived, simple example is clearly easier than the CTE.

It is perhaps not surprising that we end up with this sort of stuff in production code, because the original author only needed one simple CASE statement and then it organically grew from there.

data-ottawa · on Jan 12, 2023

When you want more than one column

santialbo · on Jan 12, 2023

To me the advantanges comes when you use some sort of query builder. You can pass your list in code rather than having to write the SQL.

pjot · on Jan 12, 2023

To me the first example makes me assume that there isn’t yet a `country` table to select from. Hence the `from values()` clause.

If you were to run the first query on a fresh db, it’d return data. Running the second would fail.

acjohnson55 · on Jan 11, 2023

Occasionally, SQL surprises with bits of composability, such as the fact that using VALUES to specify literal tuples can be used with both INSERT and the FROM clause of a SELECT.

Is there any reason syntactically the SELECT needs to be required? If you use a VALUES table literal as a subselect, you have to give it column names with an AS clause (see https://www.postgresql.org/docs/current/sql-values.html). I can imagine a simpler syntax where you'd do WITH VALUES ... AS table_name (column_1_name, ...).

Is there any reason to alias the lookup table as `codes`?

There are apparently lots of other clever uses of the WITH clause, such as https://www.postgresql.org/docs/current/queries-with.html#QU...

revskill · on Jan 12, 2023

Yes, it's what i'm doing, example here: https://gist.github.com/revskill10/57ecd8efb72f361b93e6d9d9f...

Basically, i could put with: values after the join.

jpgvm · on Jan 12, 2023

What is that YAML query language abomination in the other snippets?

Izkata · on Jan 12, 2023

sqlite doesn't have linear regression functions, and doing the math manually is a bit awkward because "b" relies on "m". Instead of duplicating the math to calculate "b", here's how to do it with CTEs:

  CREATE TABLE vals (x, y);
  INSERT INTO vals VALUES (1, 1), (2, 0.5), (3, 0.4), (4, 0.1), (5, 0);
  
  WITH
     m(v) AS (
        SELECT ((COUNT(*) * (SUM(x * y))) - (SUM(x) * SUM(y))) / ((COUNT(*) * SUM(POW(x, 2))) - (POW(SUM(x), 2)))
        FROM vals
     ),
     b(v) AS (
        SELECT (SUM(y) - (m.v * SUM(x))) / COUNT(*)
        FROM vals JOIN m
     )
  SELECT
     x AS real_x,
     y AS real_y,
     m.v AS m,
     b.v AS b,
     x * m.v + b.v AS interp_y
  FROM vals JOIN m JOIN b;

  real_x  real_y  m      b     interp_y           
  ------  ------  -----  ----  -------------------
  1       1       -0.24  1.12  0.88               
  2       0.5     -0.24  1.12  0.64               
  3       0.4     -0.24  1.12  0.4                
  4       0.1     -0.24  1.12  0.16               
  5       0       -0.24  1.12  -0.0800000000000001

If all you want is "m" and "b" the final query can just be "FROM m JOIN b" and the result will be 1 row.

acjohnson55 · on Jan 12, 2023

I'd love to see this in blog post form!

This seems like a case where it would be nice to be able to pass in a parameter `vals (x, y)`.

Izkata · on Jan 12, 2023

I've tried to get myself writing a few times but never succeeded, so I don't have a blog. But - I am actually doing parameters like so in the view where I copied the above from:

   CREATE TABLE view_confs (key, value);
   INSERT INTO view_confs VALUE ('start', 2), ('end', 4);

Now prefix it with another one:

   vals_range(x, y) AS (
      SELECT x, y
      FROM vals
      WHERE x >= (SELECT value FROM view_confs WHERE key = 'start')
        AND x <  (SELECT value FROM view_confs WHERE key = 'end')
   ),

And swap the rest of the original query to use "vals_range" instead of "vals". Just gotta remember to update view_confs with the new range whenever you want to change the view. I think it should work inside a transaction to avoid multiple threads interfering with each other, but also to be clear: This is a workaround for sqlite not having stored procedures, and not wanting to implement it in code (so I can JOIN to the view in other queries). Better to use stored procedures in other databases than this workaround.

pcblues · on Jan 12, 2023

Removing CTEs from your codebase by replacing them with the creation of a temp table and then using separate queries with as few joins as possible to populate it will give you low-locking performance hundreds of times better nearly every time. No problem with readability.

pjungwir · on Jan 12, 2023

I see this pattern all the time in mssql code, but hardly ever elsewhere. I've always wondered if there is a reason for that. Are you coming from that background? I'm curious if anyone knows why it is so favored there?

I'm not sure I agree re performance btw. I've fixed a lot of slow mssql queries by changing them in the other direction (EDIT: or just adding a join). I do see how temp tables might shorten the time you hold locks---but then you might be subverting your transaction isolation level.

gregw2 · on Jan 12, 2023

I have also seen the performance boost by using separate temp tables rather than CTEs (in my case in Redshift).

My hypothesis is that while memory use my be the same either way, you don't have the same transactional/locking requirements/overhead/consistency with multi statements as you do with a single big set of CTEs. And for analytics you rarely need that transactional isolation required by a single big CTE.

ttfkam · on Jan 12, 2023

That's because Redshift doesn't support MATERIALIZED CTEs. In mainline Postgres, a MATERIALIZED CTE acts more like a temporary table with the performance characteristics you're looking for.

zaroth · on Jan 12, 2023

Except those won’t have any indexes on them, so any joins will be horrifically slow?

ttfkam · on Jan 12, 2023

Are you putting indexes on your temporary tables? Are you including the time it takes to create them?

I can't speak for others, but most of my CTEs tend to be small enough where an index wouldn't help much. Of course toggling MATERIALIZED is an easy way to see whether the equivalent of a temp table would work best or if the planner can successfully include the body of CTE in optimization. A lot harder syntactically to swap out a subquery for a temp table, which is what the MATERIALIZED keyword essentially does under the hood (at least in Postgres).

TylerE · on Jan 12, 2023

Hell, on small tables Postgres often won’t do an index scan even when one exists because it’s faster to do a linear scan and not an index read + fetchs - or at least the planner thinks so.

drittich · on Jan 13, 2023

It's not necessarily about locking, it's a way to control what order joins get done in, which, at least in MS SQL, is up to the query engine to decide. By writing to a temp table you can override the query engine's plan, often with huge performance gains.

Less known is that you can get the same effect using a TOP N statement where N is a number larger than the rows you might get back. This presumably avoids the additional i/o penalty of the temp table creation, but I've never validated that. This trick likely is portable across db engines, too.

See https://bertwagner.com/posts/does-the-join-order-of-my-table... - Adam Machanic has some good videos diving deeper into this.

pcblues · on Jan 12, 2023

It depends on the use of the database. Where there is intense transactional writes (e.g. case management system, etc.) the CTE approach can easily get locked up, especially if the write queries do a lot of their own lookups. I have come from an MSSQL background, so maybe that is an artifact either of the DB or the way it is used in my own history :) Good question though, and I'd like to hear from someone who has experience in other DBs.

WRT subverting the transaction isolation level, using temp tables is a tacit decision to do so and best for non-vital read queries in my own experience. That is, I used it for lookups of tables that did not _need_ to be completely up to date, or when I knew the data _would_ be up to date.

hobs · on Jan 12, 2023

I find that anyone who praises chaining CTEs has not dealt with how terrible they can become, each sub table being able to take on any dependency in the parent tables is not a feature of going fast.

garfij · on Jan 12, 2023

I've used CTEs fairly extensively so I'm generally familiar with their pitfalls, but I don't quite grok your point here. Can you go into a little more detail about what you mean by "take on any dependency in the parent tables"?

branko_d · on Jan 12, 2023

But careful! In the concurrent environment, separate queries may need a higher transaction isolation level to mean the same thing as one query.

layer8 · on Jan 12, 2023

In cases like in the example given, I would also consider using a generated column (based on the CASE expression, in the example). That way the CTE doesn’t need to be repeated on each affected query — any query on the table can directly reference the derived value. Of course, this approach is only applicable for row-based values of a single table, and requires you to be able to modify the schema to add the generated column.

vlasky · on Jan 12, 2023

This also works in MySQL 8.0.19 and above but the syntax of the CTE example needs to be tweaked to use the ROW() constructor:

    WITH countries (code, name) AS (
        SELECT * FROM (VALUES 
            ROW('us', 'United States'),
            ROW('fr', 'France'),
            ROW('in', 'India')
        ) AS codes
    )
    SELECT data.code, name FROM data LEFT JOIN countries.code = data.code;

zX41ZdbW · on Jan 12, 2023

ClickHouse has the `transform` function for this purpose:

    SELECT transform(id, [1, 2, 11], ['ClickHouse', 'Postgres', 'MongoDB'], '')

It creates a lookup table. When you are using the CASE operator, in simple cases it will use the `transform` function under the hood.

There are more advanced capabilities:

- The `Join` table engine. It is a pre-warmed state for joining, a hash table kept in memory.

- Dictionaries. Pre-warmed, automatically updated lookup data structures from various sources. For example, you can connect a dictionary of company names from your operational database while keeping only the ids in ClickHouse.

And, just in case, the same CTE works as in SQLite:

    WITH countries AS (
      SELECT c1 AS code, c2 AS name FROM VALUES(
        ('us', 'United States'), ('fr', 'France'), ('in', 'India')))
    SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;

Example:

    milovidov-desktop :) WITH countries AS (
                        SELECT c1 AS code, c2 AS name FROM VALUES(
                            ('us', 'United States'), ('fr', 'France'), ('in', 'India')))
                        SELECT data.code, name FROM (SELECT 'us' AS code) AS data LEFT JOIN countries ON countries.code = data.code;

    ┌─code─┬─name──────────┐
    │ us   │ United States │
    └──────┴───────────────┘

Disclaimer: I'm working on ClickHouse.

msla · on Jan 11, 2023

What does CTE stand for in this context?

acjohnson55 · on Jan 11, 2023

Common table expression. I don't know exactly where the "common" part comes from, but I'm guessing because you can use its alias multiple times in the subsequent CTEs and the main query.

drx · on Jan 11, 2023

Common table expressions: https://www.postgresql.org/docs/current/queries-with.html

adzm · on Jan 11, 2023

common table expression. It's basically like a view that can be defined at the beginning of a query, and can be referenced within that query.

johnvaluk · on Jan 11, 2023

Common Table Expression (the WITH clause).

advisedwang · on Jan 11, 2023

Common Table Expression

revskill · on Jan 11, 2023

Composable Table Expression.

fphhotchips · on Jan 11, 2023

Fun fact, this can also be more performant, depending on the engine.

jbrwn · on Jan 12, 2023

CTEs can also perform very poorly and often in surprising ways. For example, predicate pushdown is a problem on both MSSQL and Postgresql.

garfij · on Jan 12, 2023

My understanding was that Postgres fixed this back in version 12. Are there still limitations here?

masklinn · on Jan 12, 2023

Various comments above expand upon it, but pg12 only changed CTEs which are referenced once to default to NOT MATERIALIZED. Multi-referenced CTEs remain materialized by default.

Also not-mat CTEs can perform a lot worse: https://stackoverflow.com/questions/64016236/postgres-12-mat...

But so can MAT CTEs: https://dba.stackexchange.com/questions/257014/are-there-sid...

So the limitations are that it’s very much ymmv.

epgui · on Jan 12, 2023

Nope, there’s no performance downside to using CTEs in recent postgres versions, unless the CTE is recursive or has side effects (which would be weird).

no-s · on Jan 12, 2023

> predicate pushdown is a problem on both MSSQL and Postgresql

Not sure you’re completely on target here regarding CTE performance. I don’t have deep insight into Postgresql (but “create temporary view", yeah!). MSSQL does a pretty good job in Sql2019. If the predicate is sargeable in some way pushdown is reliable. If performance is a concern, examining the residuals can lead to significant insights e.g. applying index filtering which solves obvious problems. Recursion is another story.

CTE is more likely used by data analyst queries because it is an abstraction of composition. It’s not a great abstraction but it’s better than nothing, which is mostly what you get with SQL.

fphhotchips · on Jan 14, 2023

Apologies I'm late to reply to this, but yes. The way each compiler/optimizer handles CTEs can be dramatically different from RDBMS to RDBMS.

For this specific use case though, my comment was made because CASE statements tend to just be bad in comparison to CTEs. Probably dependent on engine though, I don't know them all by heart.

sumnole · on Jan 12, 2023

When is it more performant?

shybear · on Jan 12, 2023

Isn't it more readable to make the code snippet with the case when statements into a CTE? I feel like I am misunderstanding the proposed benefits here.

VincentEvans · on Jan 12, 2023

For the sake of completeness, you can accomplish the same using UNION instead of VALUES and subquery instead of CTE.

‘’’ WITH CTE AS ( SELECT ‘NY’ AS code, ‘New York’ AS state UNION SELECT ‘NJ’, ‘New Jersey’ ) SELECT * FROM CTE ‘’’

… likewise you can also do that in a subquery that you can immediately join:

‘’’ SELECT * FROM ( SELECT ‘NY’ AS code, ‘New York’ AS state UNION SELECT ‘NJ’, ‘New Jersey’ ) SUB INNER JOIN AnotherTable AT ON SUB.code = AT.code ‘’’

reactivenz · on Jan 12, 2023

those UNION's should be UNION ALL otherwise they are deduplicated. Thus you code is worse, also the VALUES express is nicer when done in longer form

  WITH my_cte AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
  )

you can often alias the VALUES values like:

  WITH my_cte AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
        as t(col1_name, col2_name, col3_name)
   )

and some DB's allow you to alias via the cte name:

  WITH my_cte(col1_name, col2_name, col3_name) AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
  )

acjohnson55 · on Jan 12, 2023

In the last example, it seems like it would be nice for the DB to let you omit the `SELECT * FROM` part.

TylerE · on Jan 12, 2023

In any non trivial code select * is a code smell anyway. Thus parsers don’t spend cycles looking for forms that are unlikely to be useful in real code.

iblaine · on Jan 12, 2023

CTEs still give me anxiety due to their inability to be potentially not be optimized. I know many database engines today optimize for this type of thing, I'm thankful for it, but there was as time 10+ years ago when CTE expressions could take down databases. I also blame DBT for convincing people that CTEs are good. CTEs are only good if used correctly.

emodendroket · on Jan 12, 2023

Well, even if you are unmoved by the readability thing, I don't think there's a non-CTE answer to recursively finding rows.

JacKTrocinskI · on Jan 12, 2023

CTEs are just a nicer, more readable way of writing a subquery. Also, CTEs can be recursive. In Postgres, CTEs can be useful when used with the RETURNING clause. If you're using the same CTE in multiple different queries, then consider creating a view instead. I wouldn’t use a CTE for lookup values, create a lookup table instead.

alexisread · on Jan 12, 2023

Not only that, but in Postgres you can do updates in the CTEs, and chain them together eg.

WITH CTE1 AS ( SELECT.... ) CTE2 AS ( INSERT....RETURNING insertid ) INSERT INTO TABLE2 insertid as FK, otherdata...

The CTEs are executed in define-order as a single complete transaction, no need to use multiple calls to insert relational data

simonw · on Jan 12, 2023

This is a neat trick.

Worth noting that "Row values were added to SQLite version 3.15.0 (2016-10-14)" - my experience is that older versions of SQLite are a lot more common than I would hope they would be. https://www.sqlite.org/rowvalue.html

ergonaught · on Jan 12, 2023

Why is this better than just joining an actual table that expands the codes?

data-ottawa · on Jan 12, 2023

In this case it isn't.

It's better when you need a lookup to multiple columns but will not be needed in any other query/ETL job. Might be steps in a funnel or values in a bespoke customer discount query.

Some DBs do allow struct/array fields but joining against a constant CTE is portable and has low overhead.

TylerE · on Jan 12, 2023

There often isn’t an “actual” table.

jpalomaki · on Jan 12, 2023

I like CTEs, but for complicated queries I find it lot easier to make them run fast by splitting into individual queries and creating temporary tables with appropriate indexes.

pophenat · on Jan 12, 2023

Well, if you have multiple queries, you'd still have to write the cte multiple times. You could also use views for the same thing, and those are much more reusable.

mcdonje · on Jan 12, 2023

In that case, you might as well use tables. But a lot of people don't have write access for either tables or views.

preetamjinka · on Jan 12, 2023

Yeah. In many cases I've had to use CTEs like this in BI tools. When you're experimenting with datasets for dashboards it's much faster to work with CTEs than to try to make production DB changes.

NortySpock · on Jan 12, 2023

I agree, I'm often using CTEs to organize sub-result-sets because asking for a production database change / view change requires a ticket, and it's just way faster to iterate in a few CTEs in a single query.

If, eventually, a few CTE distillations become common enough, then yeah, that's grounds for a request to basically shove that CTE into a view so other people can use it.

galkk · on Jan 12, 2023

CTEs are great, but they often cause issues for optimizers.

chaps · on Jan 11, 2023

That's a really neat trick. Gosh, CTEs are great.

emodendroket · on Jan 12, 2023

I've been doing this for ages and am a big evangelist of CTEs. Don't just copy and paste the same hairy sub-query over and over!

nbevans · on Jan 12, 2023

CTEs are great for breaking down a large query into discrete steps. They're basically SQL's equivalent of a pipe function.

gunshai · on Jan 12, 2023

I loathe using sub queries.

I use ctes for everything and do my joins at the end.

This seems far more readable.

nurettin · on Jan 12, 2023

I do this with lateral joins instead.

rerdavies · on Jan 12, 2023

Sucks for localization though.