Hacker News new | past | comments | ask | show | jobs | submit login
The surprising impact of medium-size texts on PostgreSQL performance (2020) (hakibenita.com)
105 points by skadamat on Aug 30, 2023 | hide | past | favorite | 31 comments



I remember when working with Oracle 20 years ago the DBA liked to make sidecar tables to hold the big columns, so they were out of the way of most queries. It was awkward but I guess it was worth it. One time after starting to use Postgres I almost did the same thing, but then I realized the database was already doing it for me. TOAST really is the greatest thing since sliced bread!

If you made it to the end of the article, you saw the author recommends to consider making these sidecar tables by hand if you have medium-size texts. I think I would try very hard to tune TOAST before going that far. Also note that changing the table storage parameters doesn't automatically re-write the existing data. You need to re-store the data to make that happen. I believe `UPDATE t SET foo = foo` would do it, but I haven't tested.

The 8kB page size also matters when you use Postgres as a vector database, e.g. with pgvector. Jonathan Katz gave a great lightning talk about this a few months ago at PGCon. Vectors these days can have a lot of parameters, so they are big. But since the whole point is to query against the vector, you don't really want it TOASTed. I believe he has been working on some tricks to solve that problem.


Is anyone here aware of a low-level approach (like TOAST is low-level) for "string interning" in Postgres tables? Not quite like the PG enum type, as there is an unbounded set of values, potentially millions of such. But each value may also have potentially millions of duplicates (usually with a power-law distribution), with the size of the text well-worth deduplicating.

We currently do string-interning at the application level, by:

1. creating an explicit table of strings to join with, with an auto-increment pkey and a unique index on the string;

2. "registering" newly-discovered strings in the data, by inserting into this strings table with ON CONFLICT DO NOTHING;

3. then querying back for the string IDs, loading these into a local in-memory cache, and using the cache to map strings to their string IDs;

4. then actually inserting the data, where all strings have been mapped to string IDs.

It really feels like something PG should be doing itself (like TOAST!) But so far I haven't seen any extension that attempts to solve this problem.


You could write a pl/pgsql function that does that mapping for you. For example, insert into some_table values (intern(str), …) where intern can insert and return the intern_id from your intern_table. For select statements and joins you could use interned(str) that only does select without insert.


Benefits are... ? For text searching? I was getting shades of Lucene reading what you were describing, but I may be way off of what your intent was.


Our "text" isn't actually text, but rather 32-byte bytea values (SHA256 hashes).

(This is just a very close analogy, not what we're actually doing:) imagine a data warehouse for relationally querying IPLD (https://ipld.io/) -hypermedia-document-shaped data, where every piece of data has embedded "foreign keys" to other data, that come in the form of content hashes (really IPFS CIDs, but you can strip that down to just a hash for internal use.) An average modelled-document-type table will have 3-12 of these foreign keys as toplevel columns; and these columns will have various indices on them, to allow the various tables of this data to be efficiently joined through these keys.

And there will be billions of records in each of these modelled-document-type tables.

We still need the content hashes — we can't just swap them out for BIGSERIAL values at ingest time and drop the original content-hash values — because people query the data by content-hash, and also expect content-hash references to other documents to appear in query results.

Throwing all the content hashes into a big "interned strings" table, and then replacing them with content-hash-IDs in the regular tables, saves tons of space (and thereby makes much more of the regular tables fit into memory at once.) And as we can join content-hash-ID foreign-keys to content-hash-ID primary-keys, we only ever need to involve the "interned strings" table 1. as an initial lookup to find the first document to work with, and 2. to translate any content-hash-IDs in the result back to content hashes.

(If you think about it carefully, while "string interning" might be the strategy used, the problem is more like "keeping a bijective mapping between large expensive globally-unique public identifiers, and cheap small per-shard-unique private identifiers, where translation occurs between global and local identifiers at the system API boundary.")


Well, hashes aren't strings, they are binary blobs often represented as a hex string. Storing them as bytea may give better performance than dropping them all into a humongous table, even though it wastes slightly more disk space (if values indeed repeat that often).


I'm not sure you read what I wrote correctly. The technique is called "string interning" regardless of what exactly you're interning. In our case, we have a table assigning IDs to 32-byte bytea values.

(Also, to be pedantic, a Postgres BYTEA is a string; it's just what a programming language would call a "raw string" — i.e. a string of bytes. Postgres TEXT, meanwhile, is a string of characters, required to be valid in a given character encoding. The PG TEXT and BYTEA types are the same ADT with the same set of applicable operations. Of course, this puts the lie to the name "BYTEA" [i.e. "byte array"] — it's definitely not an array ADT of any kind. If it were, then you'd be able to use PG's array operations on it!)


Perhaps space-efficient "tagged" data.


No. Nor have I seen any other DBMS offer such a feature.

Your application-based approach is the way you have to do it.


Columar stores should do this at a minimum (inside 1 block though, so, say, in 100k rows block)


I was pretty confused by the index performance when getting ~50% of the table.

I think what's happening is that since the tuples in the medium table are larger, less are stored on each page. Just looking at the sizes of the tables:

medium: 977 MB

Large: 25 MB

we can assume that the medium query had to access 39.08 ( 977 / 25 ) * more pages.

medium time: 3198.539

large time: 85.74

3198.539 / 85.74 = 37 which seems to track.

What was confusing me is that I thought that the io would take longer for the large text objects because there's more of it. I think that the reason why this isn't the case is because TOAST is probably optimised to make fewer calls to io, but they tend to be much larger. it's not necessarily the size of the data, but rather how many syscalls you have to make to request the data.

pretty unsure about this, but it seems to make sense now.



Thanks! Macroexpanded:

The surprising impact of medium-size texts on PostgreSQL performance - https://news.ycombinator.com/item?id=24836979 - Oct 2020 (16 comments)


somewhat related is the use and abuse of jsonb columns in postgres, these also get stored in toast tables and cause a lot of performance issues as opposed to when being stored in actual tables and columns (not json)


If you store large jsonb values in the database, storing them out of line may be a good thing - for example, you can query metadata much faster.

Also, worth mentioning that jsonb is only benefitial when you plan on indexing/manipulating the data in the database. Otherwise, using json type may be a better choice (smaller size, stored as text, parsed on demand)


Interesting I wasn't aware of this system to side-car content. I will definitely check. But it makes storing long logs and content in DB an even better choice. (unless you need a CDN ofc)


This comment does not address the content of the article. I would be interested to know what tool is used to generate the svg table figures.


The tables look like they have been created with https://excalidraw.com/ with "sloppyness" option set to high and pointy corners activated.

EDIT: Now confirmed by author https://news.ycombinator.com/item?id=37323895


I used excalidraw.com


It looks similar to the style you get from draw.io if you choose "Sketch" as the style for an object (add box -> "Style" from right sidebar -> Sketch).


Looks like they are hand drawn


tl;dr: PostreSQL stores small values directly in the table, but has to store large values outside of the table by itself, and the table merely contains a reference to the external storage. In either case, the table itself is small, and performance is fast. However, if you store data that is just under the size threshold for moving outside of the table, the table gets large, and performance degrades. Thus, the tl;dr of the the tl;dr: small and large TEXT values are faster than medium-sized TEXT values.


'tis why some newer DB storage systems support variable sized pages.

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf


...for some definition of faster


Objectively slower than large/small strings. I don't think it requires that much nuance to understand the point.


*IF* your workload ignores the value of the string. That's the nuance.

From the article:

> We executed a query against all three tables to search for a string within the text value...This time, the results are more consistent with what we would expect. [Large is slower than medium.]

---

I.E. This isn't a "bug" ... this is a performance tradeoff. Transitioning values from inline to a separate table has performance positives and negatives. This article highlights a performance positive from TOAST once the toast_tuple_target boundary is crossed.


This is a somewhat-artificial case, though. In most cases I've seen in practice, even when you're using the value of the string, you're also filtering by some non-text columns, where the compactness of the TOASTed table allows you to do the filtering step on the left side of the implicit TOAST-join much faster (and keep more of the left table hot in memory); so that even despite needing to implicitly join in the TOAST records, you're only joining them in for the filtered rowset — usually a win overall.


In most cases I've seen in practice, you're querying against an index.

This drastically reduces the impact of unused fields compared to a full table scan.


Only faster for queries that don’t care about the value of the text, and do a full table scan instead of reading from the index


Or when you use the index, but still have to do a lot of IO.


Microsoft SQL Server does this as well with [n]varchar(max) as any size that exceeds a data pages 8k size goes off to a special different data page.

tl;dr; CPU cache sizes are small and text is large. Using numeric fields will always be faster. "hackyhacky" is equal to 1.25 Int64s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: