A lovely way to sort strings, especially strings that may have long shared prefixes, is a 3-way partition quicksort. This allows you to avoid walking the same prefix over and over the way that memcmp does.
Pick your pivot element and partition the strings into <, =, and > based on the first character only. Note this differs from classic quicksort in that we're maintaining three regions, not two. Now recurse for all three regions, except for the = region you look at the second character only. etc.
It's probably a loss for filenames which tend to be short, but for long strings this is a very easy to implement speedup. There's lots more optimizations possible with this technique, e.g. counting-sort style bucketing.
The improvements in the article look pretty small between different versions because CPU time is dominated by system calls that are out of my control. If you look at userspace CPU only, the speedup between v1 and v5 is 15x. In other words, the final version uses 15 times less CPU time in userspace. It also uses less CPU in the kernel, but the improvement there isn’t as dramatic (just the openat vs open optimization).
If some clever sorting optimization could yield infinite speedup and would optimize away all loop counters and pointer arithmetics, the overall performance of ListDir v5 would only increase by 3%. v5 is basically all kernel code, there is almost nothing left to optimize algorithm-wise.
That's a nice idea. Do you know of any programs that use this?
I know that CPython treats hash tables with string keys as a special case, for performance. I wonder if it might also make sense to treat sorting lists of strings as a special case (or maybe it already does, I will check ...).
Hm yeah it looks like Python's timsort is meant to minimize the number of comparison operations (see Objects/listsort.txt). But it doesn't try to make the comparisons themselves cheaper. That would break the interface to some extent, i.e. it doesn't really fit within the model.
Would reversing the strings or writing a custom comparer to compare in reverse be simpler and get to the more significant bits quicker? You could maybe even do the file extension comparison at a later point.
You can get faster than memcmp by rolling your own SSE/AVX compare.
memcmp typically has a few branches and instruction cachelines of just trying to verify that it's running on aligned memory and checking how aligned (ie 8byte stride vs 64 byte stride). All that can be skipped with a for-loop of intrinsics if you the programmer know alignment characteristics the compiler cannot infer.
The final version of ListDir calls memcmp only when there are files in the same directory that have identical first 8 characters. Apparently, this is rare enough that memcmp doesn't show on the CPU profile. But if it ever does, I'll look into replacing it with something else.
The final implementation of ListDir has two worse-case scenarios. One is when all files have identical 8 first character but then differ almost immediately. This is bad because I'm telling memcmp that I have 256 bytes of data, which causes it to use the vectorized loop only to exit it on the first iteration. Another is when all files are 255 characters long and differ at the very end. This is bad because string comparisons become very expensive. Even though I'm not showing these benchmarks in the article, this implementation performs very well even in these cases.
The only downside is that the correctness of your code is relying on alignment that the compiler isn't able to infer, and for which there is deliberately no run-time check.
dont you want the compiler to do this for you, with ARCH flags? a bit of foresight and a little checking seems like enough.. excessive ASM is a mistake these days
.. meant to say, after rewriting code to avoid (this,that) extra actions, then let the compiler make the optimized code.. not saying "just re-run the old code with CFLAGS thus" ..
A really important optimization when scanning directories is to avoid doing the stat on each object just to get its type. On Linux, the struct dirent has a d_type (not specified by POSIX and not supported by all filesystems) which forward the type of the object from the inode to the directory entry. When portable programs that use stat to get basic type information (like "is this a directory or a regular file") are converted to use d_type, the speedup is dramatic.
> [...] every element in entries has d_type at offset -1. This can be useful to the callers that need to distinguish between regular files and directories (gitstatusd, in fact, needs this). Note how ListDir() implements this feature at zero cost, as a lucky accident of dirent64_t memory layout.
I think this is an awesome article and a really cool idea on how the author incremetally improved their code for performance. But is the API as easy to use for a programmer? I find a complex API into a function makes it much harder to understand even if its your own code months later. I also feel like the parent_fd parameter is sort of glossed over and is an exercise on how to pass that into the final versions of the function since it's just expected to be there in those versions.
The real implementation of ListDir accepts the descriptor of the directory it needs to list (not parent_fd plus dirname like it's done in the article) and doesn't close it. This is fairly straightforward. You still have to pass Arena as an extra parameter though, which adds inconvenience, and d_type is still at -1 offset -- a rather unusual thing for an API.
The biggest downside from the API perspective is that directory listing and sorting are bundled in a single function. The insight of v5 in the article is that this bundling allows us to achieve higher performance than what we can get if we have a separate API for listing which we can compose with sorting.
So it's a far cry from the cleanest API you can imagine. Levels of abstractions often have to give way when maximum performance is the goal.
The API at gitstatusd level is simple.
The ListDir API is also relatively simple. The only complication is to pass in an arena which is not a very high bar.
Every directory has entries "." and "..", which we aren't interested in. We filter them out with a helper function Dots().
bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
why would you iterate over every single file entry? both . .. entries will land at one end of sorted list anyway.
"." and ".." are filtered out before sorting. Sorting is O(N^2) in the common case (when there are fewer than 65 files in a directory), so it pays off to reduce the number of entries by 2 before we get there.
n log n? and arent . .. entries always at the beginning of the list? could just skip first 2 elements (or stop comparing after filtering . ..). What about scandir?
From the article: Digging into the source code of std::sort() we can see that it uses Insertion Sort for short collections. Our 32-element vector falls under the threshold. Insertion Sort makes O(N^2) comparisons
> arent . .. entries always at the beginning of the list?
If you're not hitting the caches then there's a lot more to gain by optimizing IO patterns, either by traversing multiple directories in parallel (to fill SSD command queues) or by performing readaheads on the directories (to be friendly to HDD elevators). Sadly the latter is somewhere between difficult and impossible .
No, I did the opposite. I made sure the disk caches are warm before each benchmark. Since all versions of ListDir are identical in terms of IO demands, warming up caches is an effective way to reduce benchmark variability and to make performance differences of different code versions easier to detect without changing their order on the performance ladder.
I would agree with this sentence if "optimizing" were replaced with "making marketing claims". Optimization is the process of finding the fastest implementation. In the case of ListDir each successive version is faster than the last on a benchmark with warm IO caches, therefore it'll be faster on the same benchmark with cold IO caches (this is not a general claim; I'm talking specifically about ListDir). Benchmarking with warm IO caches is easier (invalidating caches is generally very difficult) and yields more reliable and more actionable results, hence it's better to benchmark with hot caches. It has nothing to do with the average vs worst case.
In my experience, another important component of fast directory listings is a cron job that does “ls -R” every half hour. Because you can't control the eagerness of fs metadata caches even if your fileserver has plenty of free mem.
You can actually tune some VFS parameters to make it less likely to evict the VFS caches (vfs_cache_pressure), however, it's not well documented which settings work the best other than to set it to '1' for 'mostly don't' which mostly works well only if you have way more RAM than you need and will tend to go quite badly if that is not the case. I think the only way to determine that well would be to benchmark it for your specific use case. But it may help you over a recurring "ls -R"
Yep, every string is potentially an allocation (unless it's short and std::string implements Small String Optimization) plus O(log N) allocations by the vector itself.
C++11 didn't make returning the vector in this function faster because it's written in a way to take advantage of RVO. It did make growing the vector faster though -- individual strings now get moved instead of copied.
Also: last time I looked at fts in the context of glibc, it did not support 64 bit file offset builds (-D_FILE_OFFSET_BITS=64).
I was looking at fts because there exists a BSD licensed implementation of nftw in terms of fts; I was researching the possibility of creating a semantically extended/enriched version of nftw, without coding it entirely from scratch. So I plonked that implementation into my program and, lo and behold, error message from glibc's fts header file about not supporting 64 bit file offsets.
The job of gitstatusd is to put Git status in your shell prompt. See screenshot at the top of https://github.com/romkatv/gitstatus. This is an amazingly convenient tool.
When you are working on chromium, on every command you type gitstatusd needs to list the contents of 25,000 directories. Low level optimizations like the ones described here are what makes gitstatusd 10 times faster than `git status`, which in turn makes prompt responsive when otherwise it would be sluggish.
Isn't inotify meant for exactly this kind of thing, though? You are already introducing a hard dependency on Linux syscalls in "v4" of the optimisations, so it would seem advantageous to make use of that to avoid the full directory traversal for most of the time.
This directory-listing code only executes on a "cold" run of gitstatusd, which happens when you `cd` into a repo for the first time. Users obviously can tolerate higher prompt latency in this case but faster is still better.
inotify requires you to hold open fds for all the dirs and files you're watching iirc, so in repos that large, you'll be crushed by the file-descriptor-per-process limit.
It doesn't, you might need to tweak its limits via /proc/sys/fs/inotify/max_* on such repositories, but those limits are not the same as the much lower open FD limits (as in ulimit -n ...).
I've posted some details explaining why gitstatusd doesn't use inotify in https://github.com/romkatv/gitstatus/commit/050aaaa04b652e15.... The short version is that the default max_user_watches limit is much too low to be useful for gitstatusd and I cannot ask or expect users to change their system settings.
Your documentation indicates that you've tried with core.untrackedCache, but have you tried core.fsmonitor with the watchman hook?
Obviously your "gitstatus" is still generally useful, but if your main itch you're trying to scratch is to have a faster "status" in a particular repo then you should be able to get that down to a few milliseconds with "git status", if you're willing to have watchman sit in the background and monitor it.
> [...] have you tried core.fsmonitor with the watchman hook?
I read the docs but it seems like it's not something I can enable on users' machines. Or maybe there is a way to take advantage of it even if it's not enabled? I really haven't looked much into it.
To clarify, my main motivation isn't to make my own prompt latency low (I don't even work on large git projects) but to make Powerlevel10k as good a product as possible. Git latency is a pain point for the existing users, so I work on optimizing it.
I also cannot rely on users to enable untracked cache even if their filesystem allows it. So gitstatusd performs tests similar to `git update-index --test-untracked-cache` in the background and builds its own untracked cache if tests pass. This makes for a good user experience with no setup or configuration fiddling.
Because `git status` takes a really long time if the repository is large. And no, the linux kernel is not considered a large repository, there are far larger ones around, mostly private at companies, such as Microsoft and others.
Because you might want to display an up-to-date status permanently in a UI of some sort (e.g. an IDE or editor status bar or a shell prompt — the latter seems to be the use case here), and on big repositories it can be noticeable. Takes 150ms on the rust repository on my machine, with warm cache, that's a serious stutter when it runs on every shell prompt.
The tool's page specifically quotes Chromium which takes ~300ms to status on the author's system.
Why create a computer when an abacus counts just as effectively :) Time, precious time! We're all born with a limited number of breaths, let's not waste them sighing while waiting for a simple task to complete. In this way faster tooling is a very real method otherwise pure-tech projects can prolong the lives of people. Fix a slow popular tool and the effect is multiplied
Code that intentionally celebrates its unapproachability is not a badge of honor or pride, even as a joke. It might occasionally be an unavoidable necessity, in which case it needs enough documentation that the next person who has to deal with it has the best chance possible.
You are quoting a self-deprecating joke. I care deeply about my code being readable, which can be quite challenging when writing high-performance code close to the metal. I'm self-consciously aware of every tricky bit I put in my code. I have to be confident it's worth it (usually because it makes things faster in the real world) and offset the added complexity by writing documentation.
I mean, I think that was a bit of a joke. It's ironic that you rail against this dude for "code that's hard to understand" in reply to a long-form prose article about understanding the code and how it got there..
We'll probably have to agree to disagree.. but that blog post doesn't really resonate for me... especially the end where it talks about keep it to your "own language." I've never had patience for identity politics especially in the case of a god-damn programming language.
> It's ironic that you rail against this dude for "code that's hard to understand"
I'm sure that it was a joke, and that doesn't make it better. I'm not suggesting that this code was hard to understand, and I enjoyed the article. I'm calling out this particular point, and suggesting that even as a joke, mocking "frontend developers" is very much from the same territory that promotes "Real Programming" and mocks people for their choice of programming language or technology.
I mean it is pretty true that most frontend developers do not have the inclination to go trodding into a C++ code base making syscalls without a wrapper. Even taken uncharitably seriously, that is all the comment was insinuating. Everything else is added by you the reader.
And anyway I'm fine with exclusively frontend devs staying out of systems stuff. Anyone that wants to read OS kernel and compiler source code has tons of choices. Most I've talked to find it dry and boring. They don't belong there... that's fine.
Where are the outrage pitchforks? The comment was extremely calm. Why is any mild, polite criticism or disagreement perceived as "railing against the author"??
Pick your pivot element and partition the strings into <, =, and > based on the first character only. Note this differs from classic quicksort in that we're maintaining three regions, not two. Now recurse for all three regions, except for the = region you look at the second character only. etc.
It's probably a loss for filenames which tend to be short, but for long strings this is a very easy to implement speedup. There's lots more optimizations possible with this technique, e.g. counting-sort style bucketing.