Hacker News new | past | comments | ask | show | jobs | submit login

Our "text" isn't actually text, but rather 32-byte bytea values (SHA256 hashes).

(This is just a very close analogy, not what we're actually doing:) imagine a data warehouse for relationally querying IPLD (https://ipld.io/) -hypermedia-document-shaped data, where every piece of data has embedded "foreign keys" to other data, that come in the form of content hashes (really IPFS CIDs, but you can strip that down to just a hash for internal use.) An average modelled-document-type table will have 3-12 of these foreign keys as toplevel columns; and these columns will have various indices on them, to allow the various tables of this data to be efficiently joined through these keys.

And there will be billions of records in each of these modelled-document-type tables.

We still need the content hashes — we can't just swap them out for BIGSERIAL values at ingest time and drop the original content-hash values — because people query the data by content-hash, and also expect content-hash references to other documents to appear in query results.

Throwing all the content hashes into a big "interned strings" table, and then replacing them with content-hash-IDs in the regular tables, saves tons of space (and thereby makes much more of the regular tables fit into memory at once.) And as we can join content-hash-ID foreign-keys to content-hash-ID primary-keys, we only ever need to involve the "interned strings" table 1. as an initial lookup to find the first document to work with, and 2. to translate any content-hash-IDs in the result back to content hashes.

(If you think about it carefully, while "string interning" might be the strategy used, the problem is more like "keeping a bijective mapping between large expensive globally-unique public identifiers, and cheap small per-shard-unique private identifiers, where translation occurs between global and local identifiers at the system API boundary.")




Well, hashes aren't strings, they are binary blobs often represented as a hex string. Storing them as bytea may give better performance than dropping them all into a humongous table, even though it wastes slightly more disk space (if values indeed repeat that often).


I'm not sure you read what I wrote correctly. The technique is called "string interning" regardless of what exactly you're interning. In our case, we have a table assigning IDs to 32-byte bytea values.

(Also, to be pedantic, a Postgres BYTEA is a string; it's just what a programming language would call a "raw string" — i.e. a string of bytes. Postgres TEXT, meanwhile, is a string of characters, required to be valid in a given character encoding. The PG TEXT and BYTEA types are the same ADT with the same set of applicable operations. Of course, this puts the lie to the name "BYTEA" [i.e. "byte array"] — it's definitely not an array ADT of any kind. If it were, then you'd be able to use PG's array operations on it!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: