Hacker News new | past | comments | ask | show | jobs | submit login

The essential consideration is that roaring bitmaps are compressed — trading increased CPU overhead per read/write, for lower size-on-disk => more rows fetched per IOP + better cache-coherency. So think of the point where the CPU overhead of compression/decompression for every read/write becomes worth those gains.

Most developers make this kind of judgement every day: when deciding whether to e.g. gzip a file before SCPing it somewhere; or whether to apply e.g. lz4 to the data being written to a KV store like Redis. IMHO that same intuition applies here.

See also: the heuristic Redis uses to decide whether to back a set with a hashtable, vs. a sorted list — sits around 512 elements: https://redis.com/ebook/part-2-core-concepts/01chapter-9-red...

---

Mind you, the compression in roaring bitmaps is very efficient, has extremely low start-up costs, and is effectively "random access" rather than streaming; these low costs are why roaring bitmaps are preferred over other compressed-bitmap approaches.

And also, it not like, at small data sizes, even a "large" (hundreds/thousands of ops) CPU penalty would really be much of a penalty, either — because, with small data, you wouldn't be paying it very often! And within a "heavyweight" RDBMS like Postgres, that does a bunch of things when you make a request to it, these sorts of small CPU costs usually get lost entirely under the static overhead of query planning. (Postgres takes advantage of this property in a lot of other ways as well.)

It's only when you're trying to do something at scale, but where dataset size isn't the scaling factor (example: running a core DNS server — trillions of queries, but against only a small amount of data) that a roaring bitmap would be an explicit "wrong choice" vs. a sorted-list or b-tree or hashtable-backed presence set.




> deciding whether to e.g. gzip a file before SCPing it somewhere

Off topic, but to not waste time on such trivia in the future, add

  Compression yes
to your ~/.ssh/config.


Not a good idea if you're frequently sending things that are already compressed, though. Which many formats today are.

(Also, I was using `scp` as a general abstraction; personally, I'm not often pushing things around between VMs directly, but rather between VMs and object storage; where part of the question is not just about the transport encoding, but rather whether it's worth it to have the data be compressed-at-rest in the object store. Usually depends on how many times you're going to be downloading it.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: