Hacker News new | past | comments | ask | show | jobs | submit login
New book-sorting algorithm almost reaches perfection (quantamagazine.org)
249 points by isaacfrond 2 days ago | hide | past | favorite | 53 comments





"Kuszmaul remembers being surprised that a tool normally used to ensure privacy could confer other benefits."

It was surprising to me too! But reflecting on it more closely, most performance isn't about "faster" in a literal sense of "more instructions run per time", but about carefully choosing how to do less work. The security property here being "history independence" is also in a way stating "we don't need to, and literally cannot, do any work that tracks history".

It's definitely an interesting approach to performance, essentially using cryptography as a contraint to prevent more work. What properties do we need, and what properties can we ignore? The question becomes if we MUST ignore this property cryptographically, how does that affect the process and the related performance?

It certainly feels like it may be a useful perspective, a rigorous approach to performance that may be a path to more improvements in key cases.


I don't think what you're saying is accurate. Your statement would be correct if the measure of how slow the algorithm is is how much compute time it takes.

But the measurement they're actually using is how many books need to be moved. They're free to use infinite compute time AIUI.


I think the point is that having less information available can improve performance because it eliminates processing of extraneous data.

Said another way, stateless approaches are simpler than stateful. There can be an instinct to optimize by using more information, but in this case at least, that was a red herring and adding the constraint improved the results.


If there's no cap on computation time processing extraneous data is free.

Also if your algorithm is perfect and will never be misled by extraneous info.

That's a good insight. I had always thought the key to good algorithm / data structure design was to use all the information present in the data set. For example, if you know a list is sorted, you can use binary sort. But perhaps choosing how much of it to omit is key, too. It comes up less often, however. I can't think of a simple example.

> But perhaps choosing how much of it to omit is key, too. It comes up less often, however. I can't think of a simple example.

An example of that is a linked list with no particular sort order. By not caring about maintaining an order the insertion appends or preprends a node and is O(1).

As soon as you have to maintain any additional invariant, the insertion cost goes up. Either directly or amortized.


A real world analogue is detective stories that work by adding irrelevant details and it’s up to the reader to filter down to the truly important details to solve the case.

To generalize it, if figuring out what information is important is more expensive than assuming none of it is, better to simplify.


So it's basically a matter of figuring out what problem context can and should be selectively hidden from the algorithm in order to make it work 'smarter' and not 'harder.' Weird.

The actual better algorithm does use history dependence though. So I found this part of the article misleading.

Am I the only one that's trying to find the main papers that're being described in the article both the original problem, and the near optimal algorithm paper [1],[2].

Apparently both of them are linked deep inside the article, perhaps Quanta can make it mandatory to list all the reference towards the end of the article, it will be very helpful for the readers.

[1] Nearly Optimal List Labeling:

https://arxiv.org/abs/2405.00807

[2] A sparse table implementation of priority queues:

https://link.springer.com/chapter/10.1007/3-540-10843-2_34


Both papers are linked very clearly in the article, and it was easy to quickly find them just skimming through the article without even reading:

> This problem was introduced in a 1981 paper

where "1981 paper" links to https://link.springer.com/chapter/10.1007/3-540-10843-2_34

and in the next paragraph:

> Last year, in a study that was presented at the Foundations of Computer Science conference in Chicago, a team of seven researchers

where "a study" links to https://arxiv.org/abs/2405.00807

Note that both of these are in the introductory section (third and fourth paragraphs), before the article goes into details, history and context. If this is what counts as “Apparently both of them are linked deep inside the article”, then it appears there exist very different ideas of what “deep inside” means.


I was actually just looking at this problem last week, when I realized I needed to keep items in a database table positioned arbitrarily, ideally without needing to manipulate the rest of the list. So if a user adds in a new element after element 5, that element becomes 6, without needing to update the original item that came after element 5. There are indeed very sophisticated algorithms to manage this problem and minimize theoretical bounds. But it also seemed like the simplest solution for this particular version of the problem is to just use fractional amounts, and occasionally pay a penalty to relayout the list.

Wikipedia has a section on this algorithm under exponential labels: https://en.m.wikipedia.org/wiki/List-labeling_problem

Basically it works as long as your label space is large compared to the number of items. The more sophisticated methods are necessary when that isn’t the case. For example, say you have 4 bytes for the label and 1 billion items.


This sort of problem also occurs when you're trying to do CRDTs, which can roughly be described also as "design something that does Git better."

So e.g. to frame this, one approach to a CRDT is to just treat the document as a list of facts, "line 1 is 'foo', line 2 is 'bar'", and each fact has a number of times it has been asserted, and to "merge" you just add together the assertion counts, and then you can detect conflicts when a fact has been asserted more than once or fewer than zero times. So a patch says "change line 2 to 'baz'", this becomes "unassert that line 2 is 'bar', assert that line 2 is 'baz'" and it conflicts with a patch that says "change line 2 to 'quux'" because the fact "line 2 is 'bar'" has an assertion count of -1.

But anyway, in this context you might want to allow inserting lines, and then you have the list-labeling problem, you don't want the patch to unassert lines 4,5,6 just to insert a new line after line 3. So then an obvious thing is to just use a broader conception of numbers, say "line 3.5 is <X>" when you insert, and then we hide the line numbers from the user anyways, they don't need to know that internally the line numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".

So then you need a relabeling step because you eventually have some line at 3.198246315 and you want to be able to say "yeah, that's actually line 27, let's have some sanity again in this thing."

This also maybe hints at the fun of adding randomization, consider that one person might add line 3.5, then add line 3.75, and then remove line 3.5; meanwhile someone else might add a different line 3.5, add a line 3.25, and then remove their line 3.5, and these patches would both amount to "assert line 3.25 is A, assert line 3.75 is B", and would merge without conflict. This means that in general if two people are messing with the same part of the same document asynchronously, this model is not able to consistently flag a merge failure in that case, but will sometimes instead just randomly order the lines that were added.

We can then just make that a feature rather than a fault: you don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you insert at 3 + (4 — 3) * rand(). And then when two people both try to insert 12 lines between line 3 and 4 independently, when you merge them together, you get 24 lines from both, in their original orders but interleaved randomly, and like that's not the end of the world, it might be legitimately better than just declaring a merge failure and harrumphing at the user.


> This sort of problem also occurs when you're trying to do CRDTs, which can roughly be described also as "design something that does Git better."

Aren't the goals of git and CRDTs different. With git you want to get the merged result to be semantically correct. With CRDTs you want to achieve convergence (so no merge conflicts), as far as I know semantically correct convergence (not sure what to correct term is) is not really possible as it is too difficult to encode for CRDTs, though. Isn't that why CRDTs are mostly used for multiplayer interactive applications where these kinds of mismatches are quickly seen?


The technically correct term -- at least in reduction systems -- would be confluence not convergence.

Ha, I had this exact problem asked as an interview question.

IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing, but dealing with decimals is a pain so you can instead represent them as vectors, which you can just represent as a string of numbers that you sort lexicographically.

So an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.

I’m sure there’s downsides, eg it takes a lot more memory to sort these indexes (strings are much larger than numbers). It feels a bit too smart to not have unexpected downsides.


> IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing

Those are the same thing. Leaving gaps is fractional indexing. It's just fixed-point rather than floating point.

> an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.

This reminds me of the most interesting method of generating random integers in an arbitrary range from random bits: interpret the bitstream as a string representing a real number (in binary) between 0 and 1. If, for example, you want to use bits to generate a number between 0 and 5, you divide the unit interval into sixths, and examine bits until you've determined conclusively that your number lies within one of those intervals:

     +---- 0 = 0.000000000... ---+
     |         0.000 -> 0        |
     |         0.00100 -> 0      |
     +-- 1/6 = 0.001010101...  --+
     |         0.0011 -> 1       |
     |         0.0100 -> 1       |
     +-- 2/6 = 0.010101010...  --+
     |         0.011 -> 2        |
     +-- 3/6 = 0.100000000...  --+
     |         0.100 -> 3        |
     +-- 4/6 = 0.101010101...  --+
     |         0.1011 -> 4       |
     |         0.1100 -> 4       |
     +-- 5/6 = 0.110101010...  --+
     |         0.11011 -> 5      |
     |         0.111 -> 5        |
     +---------------------------+

> solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2)

Reminds me of my days writing BASIC programs.


Very similar to my first intuition on how to approach this problem. An interesting question that comes to mind is how should you pick index sizes and how often should you rebalance the whole thing. Make the index too large and you're wasting a lot of space you'll never need, too small and you're forced to rebalance too often. I'm guessing an ideal index size is such that you can rebalance at most once a night with a cron job and then avoid rebalances the whole rest of the day.

To be fair, this sounds like one of those classic problems that someone for sure already figured out in the 50s or 60s, just under a different name and/or context. Hash chaining is a similar idea, but not quite the same.


The "real life" solution of leaving gaps & reindexing is actually my earliest programming memory (as a teenager)/lesson in cleverness, of feeling like I should have been able to come up with a more clever/optimal/~~scalable~~ solution but settling for this awkward 100 200 300 thing. Nowadays I've used that approach like...dozens of times over the decades of real world projects since, and I'm still maintaining that web app I made in 2003, so I'm very glad I never came up with something unduly clever.

That is how I implemented it at work around 9 years ago. Use strings for ordering and if you use the full byte range they end up fairly compact (rather than just 1-9 as in your example). There are some edge cases that can cause it to balloon in size so there is a separate reordering step but it almost never needs to be invoked.

Theoretically, using fractionals as list labels requires unbounded amounts of memory to store the fractions. In practice that is extremely limitted. But the difference becomes really a problem. If you are not just assigning ordered labels to a collection, but if you want to use these labels directly as array indices to store the elements. That would be a more literal modelling of the library sorting problem.

like line numbers in an old BASIC program

Isn't that hash table chaining?

More like tree rebalancing.

A linked list? I guess you are forgetting to mention other requirements.

I recall presenting a problem to my students a few years ago based on the 'Library Sort' algorithm. I still clearly remember the title of the original paper: 'Insertion Sort is O(n log n)'.

Presumably this: https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaMo06-l...

Kind of a click-baity title.


It's sort of click bait, but also true, right? This is just insertion sort using non-fixed indexing?

This is a different problem though, despite the similar name.

Is there any reason to think that this algorithm will actually be faster than what is currently done in practice?

For arrays in B-tree nodes, which is the main place where I have encountered this problem, I doubt it will be faster than just using `memmove()`, and for truly large arrays, it would be easier to use a B tree.

If that is the case, this joins a number of algorithms that are asymptotically faster than algorithms in use and paradoxically slower than them. Examples of such algorithms include all of the fast matrix multiplication algorithms, which are slower than good implementations of the the textbook O(n^3) algorithm (GEMM).


These are sometimes called Galactic Algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm

The very first example on the page has a pretty good quote describing their usefulness:

>>> An example of a galactic algorithm is the fastest known way to multiply two numbers, which is based on a 1729-dimensional Fourier transform. It needs O(n log n) bit operations, but as the constants hidden by the big O notation are large, it is never used in practice. However, it also shows why galactic algorithms may still be useful. The authors state: "we are hopeful that with further refinements, the algorithm might become practical for numbers with merely billions or trillions of digits."


One interesting thing about working on numbers so large is that it becomes harder to assume that memory access is constant time. It's already not that easy now (cache miss vs hit, and external memory), but when you operate on data sizes that large, memory accesses matter by at least the square or cube root of N. If you require largely sequential access it is possible to outperform an algorithm that requires conditional random access by some function of those factors, which can make some algorithms look much worse or better compared to the classical analysis. Of course, it still does not matter whether your algorithm is going to finish in a couple of trillion years or a trillion trillion years.

> lowering the upper bound to (log n) times (log log n)^3 — equivalent to (log n)^(1.000...1)

This is true! One of the things I find cool about big-O complexity using polynomial reference classes is that logarithms give you an infinitesimal value. Take that, "infinitesimals don't really exist" people!


Wait. What? Do you have a reference where I can learn about this?

I don't really understand what you're asking for. You can easily just verify the claim for yourself.

The limit as x goes to infinity of (ln x) / (x^k) is zero for all positive k. So if you want to approximate the behavior of a logarithm with a function of the form f(x) = x^k, k must be greater than 0 [because x^0 = o(ln x)], but less than every positive real number. This is the definition of an infinitesimal.

That's why it makes sense to say that x (log x)^3 is equivalent to x^{1.000000...[infinite 0s]...1}. Though in this analogy, it's more like x^{1.000000...3}.


>> Do you have a reference where I can learn about this?

> I don't really understand what you're asking for.

That comes across as rude and condescending; please don't communicate like that. Your English skills seem OK but in case you still need help, they're asking for a "reference" -- that's an online source of some sort probably, where they can "learn about" it -- that means that the reference would contain content helping them to understand the topic better. For example, the helpful content you gave after your rude and condescending initial comment would be an example of such content.


OK. What reference would you point them to? What makes you think it's what they want?

Only believing things that you can easily reason through if you are presented with some external authority really is a cognitive antipattern though.

I’d say that prologue was more medicinal than condescending. It’s good to remind people in the age of LLMs and ubiquitous search that “you can just think things.”


I was shocked to discover how the British Library (millons of books, many many new ones every week) manages this. The first book to arrive earlier this year went on the shelf at 2025.0000001. The next went at 2025.0000002, right next to it. The electronic catalogue does the rest. No need to re-shuffle books around, but not a solution supportive of book-browsing.

Reminds me of how Amazon doesn't arrange its items by similarity like a store does, so a model of vacuum cleaner might be next to a set of kitchen plates. They actually intentionally avoid similarities so pickers won't accidentally grab a similar but wrong thing.

I lose track of so many infrequently used objects in my home -- which storage bin in which closet did I put the x-acto blade refills in? -- and one bin will overflow while another is half-empty because I try to keep similar items together. Sometimes I fantasize about tracking every possession in a spreadsheet that points to which bin, so I'd never lose anything and could always use storage space with maximum efficiency. But then I know I'd be lazy and skip updating the spreadsheet when I put something new away... plus it just seems so inhumanly weird, like something a robot would so rather than a person.


I’m trying to figure out a key constraint. Does the problem statement assume a fixed-length pre-allocated array?

No, it doesn’t assume an array at all. It’s a data structure that maintains a totally ordered set with 3 operations:

insert(X), delete(X), label(X)

Where label extracts the label of element X (which had previously been inserted but not deleted). The label is a number from 0 to n-1, where n is the number of elements currently stored.


Off topic, but now I want a screen saver of the animation at the top of the article...


This probably belongs to the post on subpixel snake, not this one on the library sort algorithm :)

Wrong post

Sounds promising! What's the key innovation behind it?

Cool. This is Lisa and Laura, they are my two retired teachers that are now helping in the library to sort shelves.

Can you explain to them what you want them to now do?

I read the paper and I would like to say, it's confusing at best. But I'm hoping you can help out.


A lot of things are confusing at first but then we discover intuitive ways to explain and teach them to others.

Also, don’t discredit the intelligence of retirees, teachers and librarians in the same post. It’s bad for karma and makes you sound like a douche. I’m sure they can figure it out.


Wasn't my intent. The article was "how to do this" with lots of great pictures. Linked page was full of high level math. I was mocking the "This one simple trick can make your shelves full" attitude.

So you answered and I'll ask, "what do they do, it does not seem simple" except for the Database people filling ....

Thanks (former teacher)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: