Hacker News new | past | comments | ask | show | jobs | submit login
CTEs as lookup tables (misfra.me)
177 points by preetamjinka on Jan 11, 2023 | hide | past | favorite | 112 comments



CTEs are low-key one of the best features of SQL. Great for debugging big queries, such as:

  with source as (
    select * from wherever
  ),
  transformed as (
    ...
  ),
  joined as (
    ...
  ),
  final as (
    ...
  )
  select * from final
You can switch `final` to `transformed` to see what the query is doing internally. Almost like having good control flow. Almost.


Almost seems like a procedural syntax like

source = ...

transformed = ...

joined = ...

final = ...


It doesn't seem procedural to me, as there is no rebinding. Such a sequence of assignments would look just as home in a functional language like Lisp, ML, or Haskell as Python. In procedural languages, idiomatically, you have mutation, in which variables are re-bound to new values, and side-effects, in which the external environment and the program can interact with each other in ways that are unconstrained.


I prefer actually materializing the tables so then I can check the output for what the transform tables and the joined tables look like.

generally can't just swap transformed in final because final depends on the output of transformed?


The trick is to change the name of the CTE in your final select. This will allow you to inspect that particular step.

This works as long as your CTE’s are correct SQL and only the logic is wrong or suspect.


I guess the point is you could just stick the results of each step that is currently a CTE into #table as separate queries, and you're only doing the work once to step through the stages?

From a debug point of view that feels more convenient to _me_.


I guess it depends, materialising all the steps might be slow and you might need to do a refactor to get a CTE-based query. If you want just want to check a few (maybe changing) records because you expect problems there a final select with an appropriate where is probably faster.

Edit: also I don’t like to leave a mess behind and CTE’s don’t require cleaning up afterwards.


With CTEs, the optimizer won't execute the queries but create one optimized version. If you materialize them, execution times can be many magnitudes higher in cases you only end up using small parts of the queries.


CTEs (common table expressions) are wonderful, they make SQL grokkable and maintainable. But, before some dev goes crazy refactoring all of the SQL in their codebase into maintainable CTEs: always benchmark. CTEs can cause your query planner to optimize incorrectly. In some cases, CTEs can force the query optimizer to choose a plan it otherwise is not choosing and be more performant - perhaps up to a certain point. (That point being the size of the tables in question/index/etc.)


CTEs used to be "optimization fences" in PostgreSQL, but that changed with v12. https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...


This blog has a VERY high signal to noise ratio for anyone interested in digging into the PostgreSQL internals. Lots of great articles.


They still are sometimes. Whenever you have a CTE which is referenced more than once, to be specific.


Similarly, joining a CTE to itself can also be problematic. In that case materializing to a temp table can be faster.


I had some code from a vendor that was hanging in SQL Server and I looked at the code and they had composed everything together using smaller CTE queries and it took forever to run. They had taken various complex criteria and executed one query for each and then combined the results -- I ended up spending a day refactoring the whole thing to single SELECT with all the criteria and it ran instantly.

This was for a nightly data importer and it was filled with queries like that -- I only fixed the one giving us issues -- but I bet I could have reduced the total runtime by 90%.


Indeed, SQL is tricky like that because there are plenty of performance consequences for a language that is meant to be declarative and optimized by a query planner.

Unless you have realistic data in the test database the performance behavior of a query can be unpredictable.

SQL is very leaky abstraction.


I'm also curious how turning some of those into views would compare, but this is also probably dependent on which database.


I'd say it would be almost as bad.

In SQL you're describing the result that you want -- you aren't (and shouldn't be) saying how to achieve that result. The engine goes off and (hopefully) finds the optimal way to create the result you are describing. When you break queries down too far, the engine has more trouble finding the optimal path.

SQL Server can "see through" views and CTEs optimize them combined with a query but depending on the calculations it might have to generate and process intermediate data that it doesn't need. In my case above, the engine was reprocessing the same data set over and over, generating duplicate rows, and combining everything removing the duplicates.

My query was significantly simpler but also, in a way, harder to follow. It wasn't as clear what the intent was. Where they had nice named CTE subqueries, I had to use comments.


If writing CTEs is breaking the optimization I’d argue that’s also an example of not really meeting the “declarative” aspect of the design fully.


I agree but all systems have limitations and SQL engines get better at the job every year.

The result of the query is say declaration X. In this case, imagine you can construct a query to formulate X with out a CTE. The database goes off and computes X from the data.

But if you start adding CTE's for say readability or because you're building up the query piece by piece and testing it in stages. You might be adding additional declarations A, B, and C and combining them to ultimately get X. In theory, the database engine should be able to see that you're still trying to get X and perform the query that same way as if it was a single query without a CTE. And, in some cases, it can actually do that.

But if A, B, and C are complicated enough then it's just unable to determine that it all equals X in the end. This is a fairly understandable limitation of the engine -- SQL can get pretty complicated. It's also making the assumption that all these CTE's are actually needed to compute the result. It took me a whole day to convert a CTE based query into a non-CTE based query and get all the same results in the same way.


Yes, the more complicated it is, the more likely the optimizer is to get confused... but also the greater the likelihood of bugs and the greater the benefit of something that makes your query easier to read.


I have strongly encouraged reckless use of CTEs throughout our product. We use in-memory instances of SQLite to evaluate any queries which would leverage CTEs. These datasets are usually small enough to reside within L1 (and certainly within L2).


“By the time it doesn’t fit in [L2/RAM] you will have a different order of magnitude of engineering team” is one of my favorite adages. Arguably a bit less true with the venture model of growth-at-all-costs showing some cracks, but still very much a good mental model!


In my experience a huge portion of the userbase who loves CTEs are analysts or devs running queries on decently large datasets, and they mostly like them because of readability and don't understand performance possibilities.

I appreciate that tools like dbt allow materialization options as both CTEs and views/tables/etc because being able to pivot between them is super nice.


Note that there are basically no special performance considerations when using CTEs in recent versions of postgres… unless your CTE is recursive or does some weird side-effect (which is unlikely).


This is incorrect. If you use a CTE more than once in a query, Postgres will materialize that data into a temp table that has no indexes. If that CTE gets joined a lot, it will incur many O(N) searched through unindexed data.

I've seen cases where a few GBs of product data gets pulled into a CTE this way. Queries take 10s to complete from processing GBs of unindexed data


Wow, this is amazing. Just the other day I was staring at a query with multiple CTEs that took 20+ seconds to run. Ended up adding an additional WHERE clause on an indexed column to the slowest CTE, which took the execution time down to 100s of ms.

Just tried the original query, but with explicitly not materialized CTEs and the runtime was basically as fast as the manually pre-filtered solution.


Not materializing lets Postgres push extra where clauses and join conditions into the CTE so this makes sense!

Now you know why I spam all the “CTEs are amazing” threads about materialization (there have been 5 or so in the last few months)


Is there a way to force the temp table to have indices within the CTE definition?


No, but this would make an incredible Postgres extension.

(I’d also love one that lets Postgres use a covering index for primary storage. This can reduce data size and load less data off disk due to locality. MySQL has this by default and SQLite has it via WITHOUT ROWID tables.)


Strange take you have there in assuming they don't understand performance possibilities. Many of the problems that analysts and such work on aren't performance bound because focusing on the performance would be a net waste of time. "Don't optimize early" and all that. For example, I had to share SQL recently where the absolute most important thing, beyond all else was to make it readable. CTEs fit the role.


Postgres has materialization options for CTEs. You can ask it to rewrite the CTE as a join via "WITH foobar AS NOT MATERIALIZED (...)". Alternatively you can force materialization by writing "WITH foobar AS MATERIALIZED (...)"


Readability over optimization is often a reasonable compromise, don't you think?


So could an optimizer be built that let's you have your cake and eat it too? CTE syntax but without the performance compromise?


Based on other comments, this has been done for psql, or at least to some extent. So that's the other thing... pre-emptively writing code in a non-intuitive way for performance reasons often leads to practices sticking around long past the rationale for them even being true.


In my experience no, not in a production database that's doing real work.


100% true.

I wrote some very elegant, readable SQL to perform a complex query. It was dog slow.

I handed it to a DBA, they ripped out the CTEs and replaced them all with temp tables. The query was an unreadable mess at the end, but boy was it orders of magnitude faster.


Did they do something else other than just popping CTE queries into temp tables? In my experience, _just_ doing that doesn't affect readability much at all.


Not that I recall, but I found it much less readable personally.


Behold, the declarative 4th generational language.


>always benchmark

…on real or at least representative data.


This is a great idea if you disable materialization (which the author of this post does not mention).

Yes, materialization is fine for small hardcoded values in the post, but for most other lookup tables, it's definitely not fine.

For example, you read this article and think great, I'll make a lookup table to map various ids to be linked across tables.

    with ids as (
        select u.user_id, u.token, s.customer_id
        from users u
        join stripe_customer s on s.user_id = u.user_id
    ),
    user_impressions as (
        select i.*, ids.id
        from impressions i join ids using token
    ),
    payments as (
        select p.*, ids.id from payments p
        join ids using customer_id
    ),
    select sum(i.views),
           sum(i.clicks),
           sum(p.amount)
    from payments p
    join user_impressions i using user_id
    where user_id = $1
    group by date;
You think, great, this gets some stats for one user since its filtered at the end, and with indexes on stripe_customer(user_id), user_impressions(token), and payments(customer_id), this will be really fast and efficient!

Nope.. since ids is referenced more than once in this query, in Postgres this causes the ids CTE to be materialized onto disk with no indexes. So not only does it take up lots of extra space on disk to store all the users in your system, to join user_impressions and payments you have to O(N) search across the ids dataset. No indexes on a materialized table.

It would be a lot faster to join to users or stripe_customer or both in a loop with indexes than O(N) search through all the users in your system.

This is particularly dangerous because if you remove the payments part, ids only has one reference, so Postgres doesn't materialize the table, and everything is fast.

This can be fixed by adding WITH foobar AS NOT MATERIALIZED to the CTE syntax. IMO it should be a syntax error to not specify AS MATERIALIZED or AS NOT MATERIALIZED.. The default has too many potential performance problems and folks should have to think about whether they want materialization or not.


`NOT MATERIALIZED` is now the default in Postgres, and has been since PG12 I believe


Only when the CTE is referenced once. If it's referenced more than once, it is automatically materialized.


This is a pretty huge caveat, one that I'm very thankful to know of. Thanks for the heads up!


side note: I asked chatgpt to rewrite this query to use subqueries and it did a reasonable job.

next step: include cte-to-subquery translation step as a part of your build pipeline. never needed tools like that but i guess they must exist since it's been such a common issue. using gpt for this is like nuking a mosquito from orbit.


AS NOT MATERIALIZED is equivalent to using sub queries


Seems like an interesting idea, but could use a better example, at least for those who aren't yet intermediate level in SQL.

In what world is this

  WITH countries (code, name) AS (
     ...>   SELECT \* FROM (VALUES
     ...>     ('us', 'United States'), ('fr', 'France'), ('in', 'India')
     ...>   ) AS codes
     ...> )
     ...> SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;

easier to read than this

  SELECT code,
     ...> CASE code 
     ...>   WHEN 'us' THEN 'United States'
     ...>   WHEN 'fr' THEN 'France'
     ...>   WHEN 'in' THEN 'India'
     ...>  END AS country
     ...> FROM data;


Yea this is a common problem with contrived code samples like these. In my experience the author is right in that real world examples of using CASE can often get out of hand - especially if you need it in more than one place. But the contrived, simple example is clearly easier than the CTE.

It is perhaps not surprising that we end up with this sort of stuff in production code, because the original author only needed one simple CASE statement and then it organically grew from there.


When you want more than one column


To me the advantanges comes when you use some sort of query builder. You can pass your list in code rather than having to write the SQL.


To me the first example makes me assume that there isn’t yet a `country` table to select from. Hence the `from values()` clause.

If you were to run the first query on a fresh db, it’d return data. Running the second would fail.


Occasionally, SQL surprises with bits of composability, such as the fact that using VALUES to specify literal tuples can be used with both INSERT and the FROM clause of a SELECT.

Is there any reason syntactically the SELECT needs to be required? If you use a VALUES table literal as a subselect, you have to give it column names with an AS clause (see https://www.postgresql.org/docs/current/sql-values.html). I can imagine a simpler syntax where you'd do WITH VALUES ... AS table_name (column_1_name, ...).

Is there any reason to alias the lookup table as `codes`?

There are apparently lots of other clever uses of the WITH clause, such as https://www.postgresql.org/docs/current/queries-with.html#QU...


Yes, it's what i'm doing, example here: https://gist.github.com/revskill10/57ecd8efb72f361b93e6d9d9f...

Basically, i could put with: values after the join.


What is that YAML query language abomination in the other snippets?


sqlite doesn't have linear regression functions, and doing the math manually is a bit awkward because "b" relies on "m". Instead of duplicating the math to calculate "b", here's how to do it with CTEs:

  CREATE TABLE vals (x, y);
  INSERT INTO vals VALUES (1, 1), (2, 0.5), (3, 0.4), (4, 0.1), (5, 0);
  
  WITH
     m(v) AS (
        SELECT ((COUNT(*) * (SUM(x * y))) - (SUM(x) * SUM(y))) / ((COUNT(*) * SUM(POW(x, 2))) - (POW(SUM(x), 2)))
        FROM vals
     ),
     b(v) AS (
        SELECT (SUM(y) - (m.v * SUM(x))) / COUNT(*)
        FROM vals JOIN m
     )
  SELECT
     x AS real_x,
     y AS real_y,
     m.v AS m,
     b.v AS b,
     x * m.v + b.v AS interp_y
  FROM vals JOIN m JOIN b;

  real_x  real_y  m      b     interp_y           
  ------  ------  -----  ----  -------------------
  1       1       -0.24  1.12  0.88               
  2       0.5     -0.24  1.12  0.64               
  3       0.4     -0.24  1.12  0.4                
  4       0.1     -0.24  1.12  0.16               
  5       0       -0.24  1.12  -0.0800000000000001
If all you want is "m" and "b" the final query can just be "FROM m JOIN b" and the result will be 1 row.


I'd love to see this in blog post form!

This seems like a case where it would be nice to be able to pass in a parameter `vals (x, y)`.


I've tried to get myself writing a few times but never succeeded, so I don't have a blog. But - I am actually doing parameters like so in the view where I copied the above from:

   CREATE TABLE view_confs (key, value);
   INSERT INTO view_confs VALUE ('start', 2), ('end', 4);
Now prefix it with another one:

   vals_range(x, y) AS (
      SELECT x, y
      FROM vals
      WHERE x >= (SELECT value FROM view_confs WHERE key = 'start')
        AND x <  (SELECT value FROM view_confs WHERE key = 'end')
   ),
And swap the rest of the original query to use "vals_range" instead of "vals". Just gotta remember to update view_confs with the new range whenever you want to change the view. I think it should work inside a transaction to avoid multiple threads interfering with each other, but also to be clear: This is a workaround for sqlite not having stored procedures, and not wanting to implement it in code (so I can JOIN to the view in other queries). Better to use stored procedures in other databases than this workaround.


Removing CTEs from your codebase by replacing them with the creation of a temp table and then using separate queries with as few joins as possible to populate it will give you low-locking performance hundreds of times better nearly every time. No problem with readability.


I see this pattern all the time in mssql code, but hardly ever elsewhere. I've always wondered if there is a reason for that. Are you coming from that background? I'm curious if anyone knows why it is so favored there?

I'm not sure I agree re performance btw. I've fixed a lot of slow mssql queries by changing them in the other direction (EDIT: or just adding a join). I do see how temp tables might shorten the time you hold locks---but then you might be subverting your transaction isolation level.


I have also seen the performance boost by using separate temp tables rather than CTEs (in my case in Redshift).

My hypothesis is that while memory use my be the same either way, you don't have the same transactional/locking requirements/overhead/consistency with multi statements as you do with a single big set of CTEs. And for analytics you rarely need that transactional isolation required by a single big CTE.


That's because Redshift doesn't support MATERIALIZED CTEs. In mainline Postgres, a MATERIALIZED CTE acts more like a temporary table with the performance characteristics you're looking for.


Except those won’t have any indexes on them, so any joins will be horrifically slow?


Are you putting indexes on your temporary tables? Are you including the time it takes to create them?

I can't speak for others, but most of my CTEs tend to be small enough where an index wouldn't help much. Of course toggling MATERIALIZED is an easy way to see whether the equivalent of a temp table would work best or if the planner can successfully include the body of CTE in optimization. A lot harder syntactically to swap out a subquery for a temp table, which is what the MATERIALIZED keyword essentially does under the hood (at least in Postgres).


Hell, on small tables Postgres often won’t do an index scan even when one exists because it’s faster to do a linear scan and not an index read + fetchs - or at least the planner thinks so.


It's not necessarily about locking, it's a way to control what order joins get done in, which, at least in MS SQL, is up to the query engine to decide. By writing to a temp table you can override the query engine's plan, often with huge performance gains.

Less known is that you can get the same effect using a TOP N statement where N is a number larger than the rows you might get back. This presumably avoids the additional i/o penalty of the temp table creation, but I've never validated that. This trick likely is portable across db engines, too.

See https://bertwagner.com/posts/does-the-join-order-of-my-table... - Adam Machanic has some good videos diving deeper into this.


It depends on the use of the database. Where there is intense transactional writes (e.g. case management system, etc.) the CTE approach can easily get locked up, especially if the write queries do a lot of their own lookups. I have come from an MSSQL background, so maybe that is an artifact either of the DB or the way it is used in my own history :) Good question though, and I'd like to hear from someone who has experience in other DBs.

WRT subverting the transaction isolation level, using temp tables is a tacit decision to do so and best for non-vital read queries in my own experience. That is, I used it for lookups of tables that did not _need_ to be completely up to date, or when I knew the data _would_ be up to date.


I find that anyone who praises chaining CTEs has not dealt with how terrible they can become, each sub table being able to take on any dependency in the parent tables is not a feature of going fast.


I've used CTEs fairly extensively so I'm generally familiar with their pitfalls, but I don't quite grok your point here. Can you go into a little more detail about what you mean by "take on any dependency in the parent tables"?


But careful! In the concurrent environment, separate queries may need a higher transaction isolation level to mean the same thing as one query.


In cases like in the example given, I would also consider using a generated column (based on the CASE expression, in the example). That way the CTE doesn’t need to be repeated on each affected query — any query on the table can directly reference the derived value. Of course, this approach is only applicable for row-based values of a single table, and requires you to be able to modify the schema to add the generated column.


This also works in MySQL 8.0.19 and above but the syntax of the CTE example needs to be tweaked to use the ROW() constructor:

    WITH countries (code, name) AS (
        SELECT * FROM (VALUES 
            ROW('us', 'United States'),
            ROW('fr', 'France'),
            ROW('in', 'India')
        ) AS codes
    )
    SELECT data.code, name FROM data LEFT JOIN countries.code = data.code;


ClickHouse has the `transform` function for this purpose:

    SELECT transform(id, [1, 2, 11], ['ClickHouse', 'Postgres', 'MongoDB'], '')
It creates a lookup table. When you are using the CASE operator, in simple cases it will use the `transform` function under the hood.

There are more advanced capabilities:

- The `Join` table engine. It is a pre-warmed state for joining, a hash table kept in memory.

- Dictionaries. Pre-warmed, automatically updated lookup data structures from various sources. For example, you can connect a dictionary of company names from your operational database while keeping only the ids in ClickHouse.

And, just in case, the same CTE works as in SQLite:

    WITH countries AS (
      SELECT c1 AS code, c2 AS name FROM VALUES(
        ('us', 'United States'), ('fr', 'France'), ('in', 'India')))
    SELECT data.code, name FROM data LEFT JOIN countries ON countries.code = data.code;
Example:

    milovidov-desktop :) WITH countries AS (
                        SELECT c1 AS code, c2 AS name FROM VALUES(
                            ('us', 'United States'), ('fr', 'France'), ('in', 'India')))
                        SELECT data.code, name FROM (SELECT 'us' AS code) AS data LEFT JOIN countries ON countries.code = data.code;

    ┌─code─┬─name──────────┐
    │ us   │ United States │
    └──────┴───────────────┘
Disclaimer: I'm working on ClickHouse.


What does CTE stand for in this context?


Common table expression. I don't know exactly where the "common" part comes from, but I'm guessing because you can use its alias multiple times in the subsequent CTEs and the main query.



common table expression. It's basically like a view that can be defined at the beginning of a query, and can be referenced within that query.


Common Table Expression (the WITH clause).


Common Table Expression


Composable Table Expression.


Fun fact, this can also be more performant, depending on the engine.


CTEs can also perform very poorly and often in surprising ways. For example, predicate pushdown is a problem on both MSSQL and Postgresql.


My understanding was that Postgres fixed this back in version 12. Are there still limitations here?


Various comments above expand upon it, but pg12 only changed CTEs which are referenced once to default to NOT MATERIALIZED. Multi-referenced CTEs remain materialized by default.

Also not-mat CTEs can perform a lot worse: https://stackoverflow.com/questions/64016236/postgres-12-mat...

But so can MAT CTEs: https://dba.stackexchange.com/questions/257014/are-there-sid...

So the limitations are that it’s very much ymmv.


Nope, there’s no performance downside to using CTEs in recent postgres versions, unless the CTE is recursive or has side effects (which would be weird).


> predicate pushdown is a problem on both MSSQL and Postgresql

Not sure you’re completely on target here regarding CTE performance. I don’t have deep insight into Postgresql (but “create temporary view", yeah!). MSSQL does a pretty good job in Sql2019. If the predicate is sargeable in some way pushdown is reliable. If performance is a concern, examining the residuals can lead to significant insights e.g. applying index filtering which solves obvious problems. Recursion is another story.

CTE is more likely used by data analyst queries because it is an abstraction of composition. It’s not a great abstraction but it’s better than nothing, which is mostly what you get with SQL.


Apologies I'm late to reply to this, but yes. The way each compiler/optimizer handles CTEs can be dramatically different from RDBMS to RDBMS.

For this specific use case though, my comment was made because CASE statements tend to just be bad in comparison to CTEs. Probably dependent on engine though, I don't know them all by heart.


When is it more performant?


Isn't it more readable to make the code snippet with the case when statements into a CTE? I feel like I am misunderstanding the proposed benefits here.


For the sake of completeness, you can accomplish the same using UNION instead of VALUES and subquery instead of CTE.

‘’’ WITH CTE AS ( SELECT ‘NY’ AS code, ‘New York’ AS state UNION SELECT ‘NJ’, ‘New Jersey’ ) SELECT * FROM CTE ‘’’

… likewise you can also do that in a subquery that you can immediately join:

‘’’ SELECT * FROM ( SELECT ‘NY’ AS code, ‘New York’ AS state UNION SELECT ‘NJ’, ‘New Jersey’ ) SUB INNER JOIN AnotherTable AT ON SUB.code = AT.code ‘’’


those UNION's should be UNION ALL otherwise they are deduplicated. Thus you code is worse, also the VALUES express is nicer when done in longer form

  WITH my_cte AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
  )

you can often alias the VALUES values like:

  WITH my_cte AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
        as t(col1_name, col2_name, col3_name)
   )
and some DB's allow you to alias via the cte name:

  WITH my_cte(col1_name, col2_name, col3_name) AS (
    SELECT \* FROM VALUES
       (1, 'column 2 value', 3.0),
       (2, 'column 2 value', 3.0),
       (3, 'column 2 value', 3.0),
       (4, 'column 2 value', 3.0)
  )


In the last example, it seems like it would be nice for the DB to let you omit the `SELECT * FROM` part.


In any non trivial code select * is a code smell anyway. Thus parsers don’t spend cycles looking for forms that are unlikely to be useful in real code.


CTEs still give me anxiety due to their inability to be potentially not be optimized. I know many database engines today optimize for this type of thing, I'm thankful for it, but there was as time 10+ years ago when CTE expressions could take down databases. I also blame DBT for convincing people that CTEs are good. CTEs are only good if used correctly.


Well, even if you are unmoved by the readability thing, I don't think there's a non-CTE answer to recursively finding rows.


CTEs are just a nicer, more readable way of writing a subquery. Also, CTEs can be recursive. In Postgres, CTEs can be useful when used with the RETURNING clause. If you're using the same CTE in multiple different queries, then consider creating a view instead. I wouldn’t use a CTE for lookup values, create a lookup table instead.


Not only that, but in Postgres you can do updates in the CTEs, and chain them together eg.

WITH CTE1 AS ( SELECT.... ) CTE2 AS ( INSERT....RETURNING insertid ) INSERT INTO TABLE2 insertid as FK, otherdata...

The CTEs are executed in define-order as a single complete transaction, no need to use multiple calls to insert relational data


This is a neat trick.

Worth noting that "Row values were added to SQLite version 3.15.0 (2016-10-14)" - my experience is that older versions of SQLite are a lot more common than I would hope they would be. https://www.sqlite.org/rowvalue.html


Why is this better than just joining an actual table that expands the codes?


In this case it isn't.

It's better when you need a lookup to multiple columns but will not be needed in any other query/ETL job. Might be steps in a funnel or values in a bespoke customer discount query.

Some DBs do allow struct/array fields but joining against a constant CTE is portable and has low overhead.


There often isn’t an “actual” table.


I like CTEs, but for complicated queries I find it lot easier to make them run fast by splitting into individual queries and creating temporary tables with appropriate indexes.


Well, if you have multiple queries, you'd still have to write the cte multiple times. You could also use views for the same thing, and those are much more reusable.


In that case, you might as well use tables. But a lot of people don't have write access for either tables or views.


Yeah. In many cases I've had to use CTEs like this in BI tools. When you're experimenting with datasets for dashboards it's much faster to work with CTEs than to try to make production DB changes.


I agree, I'm often using CTEs to organize sub-result-sets because asking for a production database change / view change requires a ticket, and it's just way faster to iterate in a few CTEs in a single query.

If, eventually, a few CTE distillations become common enough, then yeah, that's grounds for a request to basically shove that CTE into a view so other people can use it.


CTEs are great, but they often cause issues for optimizers.


That's a really neat trick. Gosh, CTEs are great.


I've been doing this for ages and am a big evangelist of CTEs. Don't just copy and paste the same hairy sub-query over and over!


CTEs are great for breaking down a large query into discrete steps. They're basically SQL's equivalent of a pipe function.


I loathe using sub queries.

I use ctes for everything and do my joins at the end.

This seems far more readable.


I do this with lateral joins instead.


Sucks for localization though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: