Hacker News new | past | comments | ask | show | jobs | submit login
Inspecting C's qsort Through Animation (nullprogram.com)
120 points by nayuki on Nov 21, 2016 | hide | past | favorite | 24 comments



Frame Count != sort speed. Cache effects are going to dominate. How close together the red dots are will more closely represent how fast it is.


> Cache effects are going to dominate.

I see this stated frequently, but I rarely see evidence to back it up.

I grabbed his code, removed the drawing function and tested it with perf [1]. (Using glibc's qsort).

First test was with 1 million elements in the array. That's 4 MB of data, so it fits in L3 cache with room to spare.

Perf reports an average of 1.02 instructions/cycle with a cache miss rate of 2.829%.

Second test: 10 million elements (40 MB). Doesn't fit in cache.

Perf says: 0.95 instructions/cycle, cache miss rate of 37.352%.

Third test has 500 million elements (2 GB). Fits in RAM but not cache.

1.01 instructions/cycle and a cache miss rate of 66.555%.

If the CPU was stalling frequently, waiting for data, I would expect the number of instructions/cycle to drop dramatically as the cache miss rate increases. It drops slightly from 4MB to 40MB, but inexplicable increases from 40MB to 2GB.

So I see two possibilities:

1. I'm dumb and misinterpreting the perf data.

2. The out-of-order core can work around data stalls so effectively for this algorithm, the cache miss rate doesn't really matter.

[1] https://perf.wiki.kernel.org/index.php/Main_Page


I wonder if the speculatively executed instructions that are not used are still counted as a part of the instructions/cycle number? Definitely interesting results though.


If your comparison is sufficiently expensive, the frame count is an accurate representation.


Sedgewick did animations of quicksort and other algorithms back in the 80's. He coincidentally did his PhD on quicksort under Knuth. Jon Bentley gave a Google Talk about quicksort and inputs that drive it to O(n^2) https://www.youtube.com/watch?v=QvgYAQzg1z8 His implementation of a production quicksort with McIroy is widely used. It's in BSD, glibc (non-recursive version--some here are saying that glibc qsort is a mergesort, but at least in the code I've read that's not true. Perhaps there's some confusion over the memory allocated by glibc qsort for the pointer stack it creates to avoid recursion). The paper by Bentley and McIlroy called ``Engineering a quicksort'' fills in many details that Bentley omits in his Google Talk.


Surprised that musl I'd the slowest. Anyone have any insight into why qsort was coded like this?


I can wager a guess: On paper smoothsort appears to be the better sorting algorithm, with identically or better complexity then Quicksort and Mergesort in basically every category. In addition, smoothsort works better with data that's close to being sorted.

Why then is it slower in this case, and not as widely used? Part of the reason is that smoothsort is simply not very widely known, and thus not as many people know how it works compared to Quicksort or Mergesort. Nothing that, while complexity is important to take into account and looks good on paper, complexity leaves off the 'constants', which is generally what people attribute heapsort's (And thus, smoothsort's) higher number of comparisons.

All that noted, smoothsort (and heapsort) have the big advantages of being in-place (merge is not), requiring O(1) extra space in the worst case (Quicksort can require up to O(n) extra space), and has a guaranteed upper-bound of O(n lg n) in the worst case (Quicksort can degenerate to O(n^2) performance with bad data). This means that heapsort may not always outperform Quicksort, but gives better guarantees about speed and data usage then Quicksort (And much better then Mergesort, which requires a lots of extra allocated data). Smoothsort is essentially just a better heapsort which functions even better when the list is close to already sorted.


Couldn't the implementation use a cheap check for near-sorting and only use smoothsort for near-sorted sequences, or use an insertion sort to increase sorted runs like timsort?


The original post links to this page: http://calmerthanyouare.org/2013/05/31/qsort-shootout.html

There is an explanation provided that may be the reason, but it is not entirely convincing.

  Smoothsort is a variation of heapsort that performs especially well on nearly sorted inputs. It’s a pretty complicated beast and it does take some effort to wrap one’s head around it. Keith Schwarz has posted a nice walkthrough that the interested reader should check out.

  Considering that musl libc puts simplicity among its design goals, the choice of smoothsort could seem odd. The code is however clean and readable and getting an O(nlogn)O(nlog⁡n) worst case with near linear complexity for nearly sorted inputs is a pretty sweet deal.


Apparently, to optimize for low memory usage (from reading the comments in the commit)

http://git.musl-libc.org/cgit/musl/commit/?id=22263709eda9f7...


Doesn't quicksort have the least memory usage ?


Quicksort still requires O(lg n) memory because of its recursive nature. Since (lg n) is generally small, it's not that big of a deal. But with bad data, it's possible for Quicksort to degenerate to O(n) extra memory, which could be an issue (Especially for machines with little stack space).

Heapsort (And thus smoothsort) are not recursive, so they require a constant amount of memory for all inputs.


What if you always recurse into the smaller half, and use tail-recursion/looping to process the rest? Doesn't that keep the space usage bounded to O(log(n))?


This is correct, and is noted in the Wikipedia article's discussion of memory usage[1].

(Which I was only reading because I was erroneously thinking that quicksort was O(1) memory – I was completely forgetting stack space.)

[1]: https://en.wikipedia.org/wiki/Quicksort#Space_complexity


https://www.youtube.com/user/udiprod/videos has some fun/pretty sort animations that illustrate clearly the comparison and swaps separately.


I'd love to see this with different implementations of C++ std::sort.


Any particular reason none of the implementations just do a random pick of a pivot? This usually is good enough to prevent the O(n^2) solution that the diet implementation runs into.


The sort in standard library should have predictable performance, so "usually" is not good enough. Moreover, to do random pivot, you need to have a good source of randomness, and if you don't, an attacker often could predict the sequence of pivot choices, and prepare the input to ensure "random" choice of pivot results in a quadratic performance.


I experimented with this for std::sort in g++. It turns out it's really annoying when your sort function produces different output on different runs.

While quick sort doesn't guarantee the ordering of equal elements, you at least want it to be the same each time you run a particular compile of your program.


Why does the BSD implementation compare with copies of the data outside the original array? Is there a performance benefit to doing this?


EDIT: I skimmed the article and my previous answer was completely wrong.

The BSD implementation is quicksort, which doesn't have a duplicate (AKA auxiliary) array. Their implementation of mergesort has an auxiliary/duplicate array for the actual merging - that's common. In-place mergesorts exist, but the effort isn't usually worth it.

You can, however, eliminate the need for copying to the duplicate array (which they may do) by invoking sort twice - once takes input from the actual array and puts the sorted output in the duplicate, and the second takes from the duplicate and puts it back into the default. Somehow, you can merge the two together to speed up the algorithm (I personally don't know how, however they can be found implemented here: http://algs4.cs.princeton.edu/22mergesort/MergeX.java.html)


Why the hell does glibc use quicksort as a fallback with no apparent protection against the O(n^2) case? Aren't there easy in-place O(n log n) sorts? Like heapsort?


Glibc falls back to quicksort from merge sort (its default algorithm) when there's to little free memory available to accommodate merge sort's O(n) requirement. Quicksort only uses O(log n) additional memory, making it a good option in that sense. O(1) would of course be better, but even for huge inputs that logarithm makes the memory consumption a non-issue in most cases. Heapsort does give you O(1) memory, but also comes with poorer cache locality and overall performance than quicksort, making it a not so compelling option.


This is really awesome! I love seeing code visualized!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: