Hacker News new | past | comments | ask | show | jobs | submit login

Is anyone here aware of a low-level approach (like TOAST is low-level) for "string interning" in Postgres tables? Not quite like the PG enum type, as there is an unbounded set of values, potentially millions of such. But each value may also have potentially millions of duplicates (usually with a power-law distribution), with the size of the text well-worth deduplicating.

We currently do string-interning at the application level, by:

1. creating an explicit table of strings to join with, with an auto-increment pkey and a unique index on the string;

2. "registering" newly-discovered strings in the data, by inserting into this strings table with ON CONFLICT DO NOTHING;

3. then querying back for the string IDs, loading these into a local in-memory cache, and using the cache to map strings to their string IDs;

4. then actually inserting the data, where all strings have been mapped to string IDs.

It really feels like something PG should be doing itself (like TOAST!) But so far I haven't seen any extension that attempts to solve this problem.




You could write a pl/pgsql function that does that mapping for you. For example, insert into some_table values (intern(str), …) where intern can insert and return the intern_id from your intern_table. For select statements and joins you could use interned(str) that only does select without insert.


Benefits are... ? For text searching? I was getting shades of Lucene reading what you were describing, but I may be way off of what your intent was.


Our "text" isn't actually text, but rather 32-byte bytea values (SHA256 hashes).

(This is just a very close analogy, not what we're actually doing:) imagine a data warehouse for relationally querying IPLD (https://ipld.io/) -hypermedia-document-shaped data, where every piece of data has embedded "foreign keys" to other data, that come in the form of content hashes (really IPFS CIDs, but you can strip that down to just a hash for internal use.) An average modelled-document-type table will have 3-12 of these foreign keys as toplevel columns; and these columns will have various indices on them, to allow the various tables of this data to be efficiently joined through these keys.

And there will be billions of records in each of these modelled-document-type tables.

We still need the content hashes — we can't just swap them out for BIGSERIAL values at ingest time and drop the original content-hash values — because people query the data by content-hash, and also expect content-hash references to other documents to appear in query results.

Throwing all the content hashes into a big "interned strings" table, and then replacing them with content-hash-IDs in the regular tables, saves tons of space (and thereby makes much more of the regular tables fit into memory at once.) And as we can join content-hash-ID foreign-keys to content-hash-ID primary-keys, we only ever need to involve the "interned strings" table 1. as an initial lookup to find the first document to work with, and 2. to translate any content-hash-IDs in the result back to content hashes.

(If you think about it carefully, while "string interning" might be the strategy used, the problem is more like "keeping a bijective mapping between large expensive globally-unique public identifiers, and cheap small per-shard-unique private identifiers, where translation occurs between global and local identifiers at the system API boundary.")


Well, hashes aren't strings, they are binary blobs often represented as a hex string. Storing them as bytea may give better performance than dropping them all into a humongous table, even though it wastes slightly more disk space (if values indeed repeat that often).


I'm not sure you read what I wrote correctly. The technique is called "string interning" regardless of what exactly you're interning. In our case, we have a table assigning IDs to 32-byte bytea values.

(Also, to be pedantic, a Postgres BYTEA is a string; it's just what a programming language would call a "raw string" — i.e. a string of bytes. Postgres TEXT, meanwhile, is a string of characters, required to be valid in a given character encoding. The PG TEXT and BYTEA types are the same ADT with the same set of applicable operations. Of course, this puts the lie to the name "BYTEA" [i.e. "byte array"] — it's definitely not an array ADT of any kind. If it were, then you'd be able to use PG's array operations on it!)


Perhaps space-efficient "tagged" data.


No. Nor have I seen any other DBMS offer such a feature.

Your application-based approach is the way you have to do it.


Columar stores should do this at a minimum (inside 1 block though, so, say, in 100k rows block)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: