If the goal is to count distinct users who have seen something, but also to efficiently retrieve the actual set of users who have seen something — then why wasn't a roaring bitmap (https://pgxn.org/dist/pg_roaringbitmap/) considered as an option? Presuming user IDs are non-sparse nonnegative integers (as SERIAL/BIGSERIAL pkeys usually are), a roaring bitmap should do everything hstore does, just a lot more efficiently.
I just wanted to say thanks for linking. I feel like I'm always on the search for new PG storage extensions, even if I'm not able to use them everywhere (sad to see AWS RDS wasn't in list of vendors that include/support it).
the docs/paper for roaring bitmap says it's to be used for large datasets, but I must have missed where "large" is defined. In practice, how large does a set have to be to take advantage of roaring bitmaps?
The essential consideration is that roaring bitmaps are compressed — trading increased CPU overhead per read/write, for lower size-on-disk => more rows fetched per IOP + better cache-coherency. So think of the point where the CPU overhead of compression/decompression for every read/write becomes worth those gains.
Most developers make this kind of judgement every day: when deciding whether to e.g. gzip a file before SCPing it somewhere; or whether to apply e.g. lz4 to the data being written to a KV store like Redis. IMHO that same intuition applies here.
Mind you, the compression in roaring bitmaps is very efficient, has extremely low start-up costs, and is effectively "random access" rather than streaming; these low costs are why roaring bitmaps are preferred over other compressed-bitmap approaches.
And also, it not like, at small data sizes, even a "large" (hundreds/thousands of ops) CPU penalty would really be much of a penalty, either — because, with small data, you wouldn't be paying it very often! And within a "heavyweight" RDBMS like Postgres, that does a bunch of things when you make a request to it, these sorts of small CPU costs usually get lost entirely under the static overhead of query planning. (Postgres takes advantage of this property in a lot of other ways as well.)
It's only when you're trying to do something at scale, but where dataset size isn't the scaling factor (example: running a core DNS server — trillions of queries, but against only a small amount of data) that a roaring bitmap would be an explicit "wrong choice" vs. a sorted-list or b-tree or hashtable-backed presence set.
Not a good idea if you're frequently sending things that are already compressed, though. Which many formats today are.
(Also, I was using `scp` as a general abstraction; personally, I'm not often pushing things around between VMs directly, but rather between VMs and object storage; where part of the question is not just about the transport encoding, but rather whether it's worth it to have the data be compressed-at-rest in the object store. Usually depends on how many times you're going to be downloading it.)
Author here, a roaring bitmap could absolutely have been used!
The concept is similar and it absolutely should have been mentioned.
> Presuming user IDs are non-sparse nonnegative integers (as SERIAL/BIGSERIAL pkeys usually are),
I personally wanted to build a method that worked for textual content easily but with that constraint (which is valid and what I ended up doing in practice) a roaring bitmap definitely works!