Is anyone here aware of a low-level approach (like TOAST is low-level) for "stri...

tmpfile · on Aug 31, 2023

You could write a pl/pgsql function that does that mapping for you. For example, insert into some_table values (intern(str), …) where intern can insert and return the intern_id from your intern_table. For select statements and joins you could use interned(str) that only does select without insert.

mgkimsal · on Aug 30, 2023

Benefits are... ? For text searching? I was getting shades of Lucene reading what you were describing, but I may be way off of what your intent was.

derefr · on Aug 30, 2023

Our "text" isn't actually text, but rather 32-byte bytea values (SHA256 hashes).

(This is just a very close analogy, not what we're actually doing:) imagine a data warehouse for relationally querying IPLD (https://ipld.io/) -hypermedia-document-shaped data, where every piece of data has embedded "foreign keys" to other data, that come in the form of content hashes (really IPFS CIDs, but you can strip that down to just a hash for internal use.) An average modelled-document-type table will have 3-12 of these foreign keys as toplevel columns; and these columns will have various indices on them, to allow the various tables of this data to be efficiently joined through these keys.

And there will be billions of records in each of these modelled-document-type tables.

We still need the content hashes — we can't just swap them out for BIGSERIAL values at ingest time and drop the original content-hash values — because people query the data by content-hash, and also expect content-hash references to other documents to appear in query results.

Throwing all the content hashes into a big "interned strings" table, and then replacing them with content-hash-IDs in the regular tables, saves tons of space (and thereby makes much more of the regular tables fit into memory at once.) And as we can join content-hash-ID foreign-keys to content-hash-ID primary-keys, we only ever need to involve the "interned strings" table 1. as an initial lookup to find the first document to work with, and 2. to translate any content-hash-IDs in the result back to content hashes.

(If you think about it carefully, while "string interning" might be the strategy used, the problem is more like "keeping a bijective mapping between large expensive globally-unique public identifiers, and cheap small per-shard-unique private identifiers, where translation occurs between global and local identifiers at the system API boundary.")

indigo945 · on Aug 30, 2023

Well, hashes aren't strings, they are binary blobs often represented as a hex string. Storing them as bytea may give better performance than dropping them all into a humongous table, even though it wastes slightly more disk space (if values indeed repeat that often).

derefr · on Aug 30, 2023

I'm not sure you read what I wrote correctly. The technique is called "string interning" regardless of what exactly you're interning. In our case, we have a table assigning IDs to 32-byte bytea values.

(Also, to be pedantic, a Postgres BYTEA is a string; it's just what a programming language would call a "raw string" — i.e. a string of bytes. Postgres TEXT, meanwhile, is a string of characters, required to be valid in a given character encoding. The PG TEXT and BYTEA types are the same ADT with the same set of applicable operations. Of course, this puts the lie to the name "BYTEA" [i.e. "byte array"] — it's definitely not an array ADT of any kind. If it were, then you'd be able to use PG's array operations on it!)

paulddraper · on Aug 30, 2023

Perhaps space-efficient "tagged" data.

paulddraper · on Aug 30, 2023

No. Nor have I seen any other DBMS offer such a feature.

Your application-based approach is the way you have to do it.

ddorian43 · on Aug 30, 2023

Columar stores should do this at a minimum (inside 1 block though, so, say, in 100k rows block)