More

danking00 · 2025-07-24T17:20:17 1753377617

Feather appears to just be block compressed Arrow IPC [1]. Lightweight compression techniques generally achieve two orders of magnitude faster random access compared to block compression. That’s one of the benefits of formats like FastLanes, Vortex, DuckDB native, etc. DuckDB has a good blog post about it here: https://duckdb.org/2022/10/28/lightweight-compression.html

[1]: https://arrow.apache.org/docs/python/feather.html

danking00 · 2025-03-17T19:46:54 1742240814

cppreference says [1]:

    A prvalue of floating-point type can be converted to a prvalue of any integer type. The
    fractional part is truncated, that is, the fractional part is discarded.

    - If the truncated value cannot fit into the destination type, the behavior is undefined (even
      when the destination type is unsigned, modulo arithmetic does not apply).

Ugh. Like, surely the C++ stdlib could provide a "give me the best int approximation" function without triggering nasal demons. Sometimes I feel like C++ is not just difficult to use but actively trying to harm me.

[1] https://en.cppreference.com/w/cpp/language/implicit_conversi...

danking00 · 2025-03-05T18:30:43 1741199443

I think one of my favorite examples of avoiding decompression for great profit (really: health) is PLINK. If you’re not aware, it’s a tool for doing statistical genetics (things like quality control but also inferring population structure, computing relatedness between individuals, and fitting linear models). The code is mind bending to read but it’s blazing fast thanks to “compressing” genotypes into two bits (homozygous reference, heterozygous, homozygous alternate, N/A) and implementing all its operations on this 2-bit representation.

It’s exciting to see more work on compressed arrays make it out into the world. It would have been nice if a lot of PLINK’s bitfiddling functionality was in an accessible native library rather than wrapped up with the rest of that tool.

https://www.cog-genomics.org/plink/2.0/

danking00 · 2024-10-15T13:55:01 1729000501

Thanks for introducing me to these other formats! I hadn't heard of them yet. All three of fst, JDF, and Vortex appear share the goal of high throughput (de)serialization of tabular data and random access to the data. However, it is not clear to me how JDF and fst permit random access on compressed data because both appear to use block compression (respectively Blosc and LZ4 or Zstd). While both Blosc and Zstd are extremely fast, accessing a single value of a single row necessarily requires decompressing a whole block of data. Instead of O(1) random access you get O(N_ROWS_PER_BLOCK) random access.

In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.

[1] https://duckdb.org/2022/10/28/lightweight-compression.html

kwillets · 2024-10-15T15:51:22 1729007482

This paper compares the benefits of lightweight compression and other techniques:

https://blog.acolyer.org/2018/09/26/the-design-and-implement...

xiaodai · 2024-10-15T23:40:49 1729035649

I see. Very nice. So it's a trade-off. I imagine the throughput of these light-weight compression suffers a little. In analytical workloads, it's common to do things like compute the mean of a vector or compute the gradient for this batch of data so random access appear less of an issue here.

danking00 · 2024-10-16T02:13:53 1729044833

We’ll post a blog post soon with specific, benchmarked numbers, but, in this case, you can have your cake and eat it too!

The compression and decompression throughputs of Vortex (and other lightweight compression schemes) are similar or better than Parquet for many common datasets. Unlike Zstd or Blosc, the lightweight encodings are, generally, both computationally simple and SIMD friendly. We’re seeing multiple gibibytes per second on an M2 MacBook Pro on various datasets in the PBI benchmark [1].

The key insight is that most data we all work with has common patterns that don’t require sophisticated, heavyweight compression algorithm. Let’s take advantage of that fact to free up more cycles for compute kernels!

[1] https://github.com/cwida/public_bi_benchmark

xiaodai · 2024-10-16T08:49:04 1729068544

Cool looking forward to it.

danking00 · 2024-10-15T02:37:55 1728959875

Yeah we should be more clear in our description about how our footers differ from Parquet. Parquet is a bit more prescriptive; for example, it requires row groups which are not required by Vortex. If you have a column with huge values and another column of 8 bit ints, they can be paged separately, if you like.

danking00 · 2024-10-15T02:29:57 1728959397

Yeah, you and us are on the same page (heh). We don’t want the format to require row grouping. The file format has a layout schema written in a footer. A row group style layout is supported but not required. Specification of the layout will probably evolve, but currently the in-memory structure becomes the on-disk structure. So, if you have a ChunkedArray of StructArray of ChunkedArray you’ll get row groups and pages within them. If you had a StructArray of ChunkedArray you’ll just get per-column pages.

I’m working on the Python API now. I think we probably want the user to specify, on write, whether they want row groups or not and then we can enforce that as we write.

danking00 · on July 9, 2024

See also one of the author's twitter threads: https://x.com/cureffi/status/1806404950055862547

danking00 · on May 28, 2024

Unless I’m misreading, the first two references support the idea that vacancy rates are too low but don’t comment on why.

The last reference indeed argues in favor of too much financialization of housing units but that blog is also tripping a lot of my crackpot alarms.

Can you succinctly explain why you believe the low vacancy rates in major metros (which I think we agree is the cause of high rents / purchase costs) are caused primarily by units which are intentionally held empty despite demand?

In partial defense of your assertion, the FRED data does show that 1/4 to 1/3 of vacant units are for sale or rent at any given time.

https://fred.stlouisfed.org/graph/?g=1oeIe

Total vacancy rate seems to hover around 10%? Rentable and buyable unit rates are an order of magnitude lower.

I’m not convinced that the Fed owning a bunch of mortgages is evidence that private companies bought homes and aren’t renting them. Wasn’t that a bail out to prevent people from losing their homes (because the companies owning the homes weren’t solvent and I guess if the company fails maybe you get foreclosed? I’m not sure why we did things the way we did in 08)

Could not the explanation also be that a lot of homes are in places that lack demand and therefore the owners don’t bother putting them up for sale or rent?

https://fred.stlouisfed.org/graph/?g=1oeII

danking00 · on May 23, 2024

If I’m following, this presents an alternative format and implementation to LEB128 which encodes and decodes substantially faster. Notably, the implementation is quite simple. Cool! And agreed that modern CPUs really suffer from branches.

Should I interpret the plot to mean the average elapsed wall clock time per integer decoded/encoded? And can I conclude the throughput is the reciprocal? So about 100,000 integers per second or around a 1 MB/s of decompressed data.

I know this is a bit unfair because the implementation is much more complex, but my first thought is why I would use vu128 instead of Lemire’s Stream VByte: https://arxiv.org/abs/1709.08990

A slight tangent but I stumbled on this library which stores floats XOR’ed with the previous float in the stream: https://github.com/velvia/compressed-vec it seems really clever to me! They reference “Gorilla: A Fast, Scalable, In-Memory Time Series Database” which in turn references two 2006 papers: “Fast Lossless Compression of Scientific Floating-Point Data” and “Fast and Efficient Compression of Floating-Point Data”. Frustratingly, the FB paper doesn’t benchmark their XOR-based floating point encoding but the earlier two papers do.

jmillikin · on May 23, 2024

The plot is the time to encode/decode a list of 10,000 random integers in a uniform distribution. The throughput is 2-10 GiB/s[0] on my desktop, with portable scalar operations (no SIMD). Distributions skewed toward small values are slightly faster[1].

I believe a decoder than used SIMD would perform even faster since the format is friendly to the BMI instructions found on newer CPUs. You could also store the data as concat(prefix_bytes) + concat(payloads), which would really get things going.

For scientific data with lots of multi-byte f64 values, the single branch for < 0xF0 might be unnecessary and undesirable. There are other formats that can be decoded by completely branchless SIMD, at the cost of larger sizes for very small values. For example, Google has a format that uses the first byte as a bitmask: https://static.googleusercontent.com/media/research.google.c...

[0] vu128 is less affected by value size than VLQ/LEB128, so on fast hardware it decodes 10,000 8-bit values in about the same time as 10,000 64-bit values.

[1] In the discussion on lobste.rs[2] a user wondered what the results would look like with a distribution containing lots of small values. A zipf distribution's results looks like <https://i.imgur.com/WGFtz6f.png>.

[2] https://lobste.rs/s/qvoe7a/vu128_efficient_variable_length#c...

danking00 · on May 25, 2024

Thanks for the reply and clarification! That’s an impressive throughput. Am I interpreting the behavior correctly to think that: your code has one branch per int and that branch will be predicted well when your integers are all of similar magnitude?

I never fully groked the Zipf distribution but am I correct to conclude the worst case for vu128 would be alternating integers on either side of the branch condition?

Thanks for the link to the slides, also clever!

jmillikin · on May 27, 2024

  > Am I interpreting the behavior correctly to think that: your code has
  > one branch per int and that branch will be predicted well when your
  > integers are all of similar magnitude?

Yep, that's generally correct. The branch prediction seems to be harmless either way -- interleaving values on either side of the branch doesn't seem to affect throughput much on my machine (results may vary).

  > I never fully groked the Zipf distribution but am I correct to conclude
  > the worst case for vu128 would be alternating integers on either side of
  > the branch condition?

It depends on whether you're measuring throughput by values/sec or bytes/sec.

In terms of bytes/sec, the worst case is a set of integers exclusively in the range [0xF0, 0xFF] -- big enough to skip the fast path, small enough that each byte has to go through the full bit-masking.

In terms of values/sec, the worst is really big integers because writing 8 bytes (or 16, for `encode_u128`) to a buffer takes longer than writing one byte.

However, regardless of measurement, the worst case for vu128 is faster than VLQ/LEB128 -- except for distributions containing only [0x00, 0x7F] in which case they're all equivalent.

---

Also, after some comments on Lobsters about cases where space efficiency in the u16 range really matters, I'm experimenting with a version that tries to match or exceed VLQ/LEB128 for size efficiency at all bit lengths.

It has three more branches, but in practice this hasn't turned out to matter much -- it seems that indefinite loops are slower than a fixed set of branches, so still significantly faster than VLQ/LEB128 on most sets of integers.

  // like LEB128 but continuation bits are packed into the
  // leading byte -- sometimes called "prefixed varint"
  
  0xxxxxxx                            //  7 bits of payload
  10xxxxxx xxxxxxxx                   // 14 bits of payload
  110xxxxx xxxxxxxx xxxxxxxx          // 21 bits payload
  1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // 28 bits payload

  // For payloads > 28 bits, use (0xF0 | length) type prefix,
  // same as original post.
  1111____  xxxxxxxx xxxxxxxx [...]

It seems to be more fussy about compiler optimizations, though: https://github.com/rust-lang/rust/issues/125543

danking00 · on May 22, 2024

I think it might help readers to include a narrative about an example application. Perhaps I’m in the minority but I tend to think of Bloom filters as a way to reliably know something isn’t in a set (e.g. so as to not run an expensive disk read). This data structure seems to view them the dual way: “this is maybe the right value for this key”.

I’ve seen that view work for visualizations like approximate CDFs and medians where I have some statement like “with probability p, the value differs from truth by less than e”. Is this data structure used in a similar way? My instinct is that visualizations having a low rate of being wrong is OK because the human will follow up that visualization with more tests. In the end you have lots of evidence supporting the conclusion.

boyd · on May 22, 2024

Ah, we need to clarify the language! The B-field will always return the correct value for an inserted key.

False positives are only returned for keys that have not been inserted. This is akin to a Bloom filter falsely returning that a key is in the set).

dpc_01234 · on May 22, 2024

I second that "The B-field will always return the correct or Indeterminate value for an inserted key." before listing classes of errors would clarify it by a lot.