If the goal is to count distinct users who have seen something, but *also* to ef...

jakswa · on July 18, 2022

I just wanted to say thanks for linking. I feel like I'm always on the search for new PG storage extensions, even if I'm not able to use them everywhere (sad to see AWS RDS wasn't in list of vendors that include/support it).

djKianoosh · on July 18, 2022

the docs/paper for roaring bitmap says it's to be used for large datasets, but I must have missed where "large" is defined. In practice, how large does a set have to be to take advantage of roaring bitmaps?

derefr · on July 18, 2022

The essential consideration is that roaring bitmaps are compressed — trading increased CPU overhead per read/write, for lower size-on-disk => more rows fetched per IOP + better cache-coherency. So think of the point where the CPU overhead of compression/decompression for every read/write becomes worth those gains.

Most developers make this kind of judgement every day: when deciding whether to e.g. gzip a file before SCPing it somewhere; or whether to apply e.g. lz4 to the data being written to a KV store like Redis. IMHO that same intuition applies here.

See also: the heuristic Redis uses to decide whether to back a set with a hashtable, vs. a sorted list — sits around 512 elements: https://redis.com/ebook/part-2-core-concepts/01chapter-9-red...

---

Mind you, the compression in roaring bitmaps is very efficient, has extremely low start-up costs, and is effectively "random access" rather than streaming; these low costs are why roaring bitmaps are preferred over other compressed-bitmap approaches.

And also, it not like, at small data sizes, even a "large" (hundreds/thousands of ops) CPU penalty would really be much of a penalty, either — because, with small data, you wouldn't be paying it very often! And within a "heavyweight" RDBMS like Postgres, that does a bunch of things when you make a request to it, these sorts of small CPU costs usually get lost entirely under the static overhead of query planning. (Postgres takes advantage of this property in a lot of other ways as well.)

It's only when you're trying to do something at scale, but where dataset size isn't the scaling factor (example: running a core DNS server — trillions of queries, but against only a small amount of data) that a roaring bitmap would be an explicit "wrong choice" vs. a sorted-list or b-tree or hashtable-backed presence set.

5e92cb50239222b · on July 18, 2022

> deciding whether to e.g. gzip a file before SCPing it somewhere

Off topic, but to not waste time on such trivia in the future, add

  Compression yes

to your ~/.ssh/config.

derefr · on July 18, 2022

Not a good idea if you're frequently sending things that are already compressed, though. Which many formats today are.

(Also, I was using `scp` as a general abstraction; personally, I'm not often pushing things around between VMs directly, but rather between VMs and object storage; where part of the question is not just about the transport encoding, but rather whether it's worth it to have the data be compressed-at-rest in the object store. Usually depends on how many times you're going to be downloading it.)

hardwaresofton · on July 18, 2022

Author here, a roaring bitmap could absolutely have been used!

The concept is similar and it absolutely should have been mentioned.

> Presuming user IDs are non-sparse nonnegative integers (as SERIAL/BIGSERIAL pkeys usually are),

I personally wanted to build a method that worked for textual content easily but with that constraint (which is valid and what I ended up doing in practice) a roaring bitmap definitely works!

I want to give this a try in the future