Hacker News new | past | comments | ask | show | jobs | submit login
Fast Directory Listing on Linux (github.com/romkatv)
220 points by romka2 on April 7, 2019 | hide | past | favorite | 71 comments



A lovely way to sort strings, especially strings that may have long shared prefixes, is a 3-way partition quicksort. This allows you to avoid walking the same prefix over and over the way that memcmp does.

Pick your pivot element and partition the strings into <, =, and > based on the first character only. Note this differs from classic quicksort in that we're maintaining three regions, not two. Now recurse for all three regions, except for the = region you look at the second character only. etc.

It's probably a loss for filenames which tend to be short, but for long strings this is a very easy to implement speedup. There's lots more optimizations possible with this technique, e.g. counting-sort style bucketing.


That's a cool algorithm.

The improvements in the article look pretty small between different versions because CPU time is dominated by system calls that are out of my control. If you look at userspace CPU only, the speedup between v1 and v5 is 15x. In other words, the final version uses 15 times less CPU time in userspace. It also uses less CPU in the kernel, but the improvement there isn’t as dramatic (just the openat vs open optimization).

If some clever sorting optimization could yield infinite speedup and would optimize away all loop counters and pointer arithmetics, the overall performance of ListDir v5 would only increase by 3%. v5 is basically all kernel code, there is almost nothing left to optimize algorithm-wise.


That's a nice idea. Do you know of any programs that use this?

I know that CPython treats hash tables with string keys as a special case, for performance. I wonder if it might also make sense to treat sorting lists of strings as a special case (or maybe it already does, I will check ...).

Hm yeah it looks like Python's timsort is meant to minimize the number of comparison operations (see Objects/listsort.txt). But it doesn't try to make the comparisons themselves cheaper. That would break the interface to some extent, i.e. it doesn't really fit within the model.

Interesting to think about though!


> There's lots more optimizations possible with this technique, e.g. counting-sort style bucketing.

:) I was going to say that since you are half re-inventing bucket-sort anyway, why not go the whole hog.

Is the code three-part quicksort simpler and easier to understand than a bucket sort?


That's interesting, it's effectively doing a radix sort but in a more scalable way.


Would reversing the strings or writing a custom comparer to compare in reverse be simpler and get to the more significant bits quicker? You could maybe even do the file extension comparison at a later point.


what about symbol trees ?

/a/b/c/d/e => {a,b,c,d,e} symbol table + [a,b,c,d,e]

and you accumulate into a tree/trie


Tries/trees have pointers which can be slow, so I’d measure before using something like this.


You can get faster than memcmp by rolling your own SSE/AVX compare.

memcmp typically has a few branches and instruction cachelines of just trying to verify that it's running on aligned memory and checking how aligned (ie 8byte stride vs 64 byte stride). All that can be skipped with a for-loop of intrinsics if you the programmer know alignment characteristics the compiler cannot infer.


Thanks, that's good to know.

The final version of ListDir calls memcmp only when there are files in the same directory that have identical first 8 characters. Apparently, this is rare enough that memcmp doesn't show on the CPU profile. But if it ever does, I'll look into replacing it with something else.


You should try with the pathological but relatively common case of thousands of files named 'logname.YYYYMMDD.log.gz'


The final implementation of ListDir has two worse-case scenarios. One is when all files have identical 8 first character but then differ almost immediately. This is bad because I'm telling memcmp that I have 256 bytes of data, which causes it to use the vectorized loop only to exit it on the first iteration. Another is when all files are 255 characters long and differ at the very end. This is bad because string comparisons become very expensive. Even though I'm not showing these benchmarks in the article, this implementation performs very well even in these cases.


The only downside is that the correctness of your code is relying on alignment that the compiler isn't able to infer, and for which there is deliberately no run-time check.


dont you want the compiler to do this for you, with ARCH flags? a bit of foresight and a little checking seems like enough.. excessive ASM is a mistake these days

  CXXFLAGS -march=sandybridge -mtune=sandybridge


What part of the code do you expect to get faster with `-march=sandybridge -mtune=sandybridge`?

> excessive ASM is a mistake these days

There is no ASM in the article.


The point is that the compiler does not have enough information to do this automatically.


.. meant to say, after rewriting code to avoid (this,that) extra actions, then let the compiler make the optimized code.. not saying "just re-run the old code with CFLAGS thus" ..


A really important optimization when scanning directories is to avoid doing the stat on each object just to get its type. On Linux, the struct dirent has a d_type (not specified by POSIX and not supported by all filesystems) which forward the type of the object from the inode to the directory entry. When portable programs that use stat to get basic type information (like "is this a directory or a regular file") are converted to use d_type, the speedup is dramatic.


Yep, this is mentioned in the doc.

> [...] every element in entries has d_type at offset -1. This can be useful to the callers that need to distinguish between regular files and directories (gitstatusd, in fact, needs this). Note how ListDir() implements this feature at zero cost, as a lucky accident of dirent64_t memory layout.


Wow, TIL. Thanks for sharing!


I think this is an awesome article and a really cool idea on how the author incremetally improved their code for performance. But is the API as easy to use for a programmer? I find a complex API into a function makes it much harder to understand even if its your own code months later. I also feel like the parent_fd parameter is sort of glossed over and is an exercise on how to pass that into the final versions of the function since it's just expected to be there in those versions.


The real implementation of ListDir accepts the descriptor of the directory it needs to list (not parent_fd plus dirname like it's done in the article) and doesn't close it. This is fairly straightforward. You still have to pass Arena as an extra parameter though, which adds inconvenience, and d_type is still at -1 offset -- a rather unusual thing for an API.

The biggest downside from the API perspective is that directory listing and sorting are bundled in a single function. The insight of v5 in the article is that this bundling allows us to achieve higher performance than what we can get if we have a separate API for listing which we can compose with sorting.

So it's a far cry from the cleanest API you can imagine. Levels of abstractions often have to give way when maximum performance is the goal.


Since this is C++, what about using a thread_local Arena for the storage? Then you could simplify the API a bit and not re-allocate too!

I also agree, the parent_fd is trivial but it isn't as easy to use as no parameter.


The API at gitstatusd level is simple. The ListDir API is also relatively simple. The only complication is to pass in an arena which is not a very high bar.


>As an added bonus, those casts will fend off the occasional frontend developer who accidentally wanders into the codebase.

Beautiful lol


  Every directory has entries "." and "..", which we aren't interested in. We filter them out with a helper function Dots().

  bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }

why would you iterate over every single file entry? both . .. entries will land at one end of sorted list anyway.


"." and ".." are filtered out before sorting. Sorting is O(N^2) in the common case (when there are fewer than 65 files in a directory), so it pays off to reduce the number of entries by 2 before we get there.


n log n? and arent . .. entries always at the beginning of the list? could just skip first 2 elements (or stop comparing after filtering . ..). What about scandir?


> n log n?

From the article: Digging into the source code of std::sort() we can see that it uses Insertion Sort for short collections. Our 32-element vector falls under the threshold. Insertion Sort makes O(N^2) comparisons

> arent . .. entries always at the beginning of the list?

No.


Assuming that, branch predictor should make that check close to free. But the code still works even when that assumption is violated.


You forgot to quote/code block, and to add a second newline


I hope nobody ever uses code blocks. They're unreadable on mobile.


If you're not hitting the caches then there's a lot more to gain by optimizing IO patterns, either by traversing multiple directories in parallel (to fill SSD command queues) or by performing readaheads on the directories (to be friendly to HDD elevators). Sadly the latter is somewhere between difficult and impossible .


gitstatusd calls ListDir in parallel from multiple threads. At least with a fast SSD it's CPU bound. I don't have an HDD to test on.


have you dropped disk caches before each bench iteration?


No, I did the opposite. I made sure the disk caches are warm before each benchmark. Since all versions of ListDir are identical in terms of IO demands, warming up caches is an effective way to reduce benchmark variability and to make performance differences of different code versions easier to detect without changing their order on the performance ladder.


That's reasonable when optimizing the average case, but not for the worst case.


I would agree with this sentence if "optimizing" were replaced with "making marketing claims". Optimization is the process of finding the fastest implementation. In the case of ListDir each successive version is faster than the last on a benchmark with warm IO caches, therefore it'll be faster on the same benchmark with cold IO caches (this is not a general claim; I'm talking specifically about ListDir). Benchmarking with warm IO caches is easier (invalidating caches is generally very difficult) and yields more reliable and more actionable results, hence it's better to benchmark with hot caches. It has nothing to do with the average vs worst case.


In my experience, another important component of fast directory listings is a cron job that does “ls -R” every half hour. Because you can't control the eagerness of fs metadata caches even if your fileserver has plenty of free mem.


You can actually tune some VFS parameters to make it less likely to evict the VFS caches (vfs_cache_pressure), however, it's not well documented which settings work the best other than to set it to '1' for 'mostly don't' which mostly works well only if you have way more RAM than you need and will tend to go quite badly if that is not the case. I think the only way to determine that well would be to benchmark it for your specific use case. But it may help you over a recurring "ls -R"


> returning vector<string> is an unaffordable convenience.

Not anymore since C++11.

Reusing vector storage is good, but you can still use move on parameter and not the reference.

Although, in this case, reference is probably easier to write.


I think he was concerned about the cost of allocating separate strings that go into the vector.


Yep, every string is potentially an allocation (unless it's short and std::string implements Small String Optimization) plus O(log N) allocations by the vector itself.

C++11 didn't make returning the vector in this function faster because it's written in a way to take advantage of RVO. It did make growing the vector faster though -- individual strings now get moved instead of copied.


I wonder how this compares to filesystem traversal APIs like fts/ftw?


fts and ftw are glibc wrappers over glibc wrappers that I had to bypass for better performance. They are made for convenience, not for speed.


Also: last time I looked at fts in the context of glibc, it did not support 64 bit file offset builds (-D_FILE_OFFSET_BITS=64).

I was looking at fts because there exists a BSD licensed implementation of nftw in terms of fts; I was researching the possibility of creating a semantically extended/enriched version of nftw, without coding it entirely from scratch. So I plonked that implementation into my program and, lo and behold, error message from glibc's fts header file about not supporting 64 bit file offsets.


What about using nftw directly instead of opendir/readdir


nftw is a wrapper over opendir/readdir, not the other way around.


But why? Why create a tool that does `git status` 10x faster?


The job of gitstatusd is to put Git status in your shell prompt. See screenshot at the top of https://github.com/romkatv/gitstatus. This is an amazingly convenient tool.

When you are working on chromium, on every command you type gitstatusd needs to list the contents of 25,000 directories. Low level optimizations like the ones described here are what makes gitstatusd 10 times faster than `git status`, which in turn makes prompt responsive when otherwise it would be sluggish.


Isn't inotify meant for exactly this kind of thing, though? You are already introducing a hard dependency on Linux syscalls in "v4" of the optimisations, so it would seem advantageous to make use of that to avoid the full directory traversal for most of the time.


This directory-listing code only executes on a "cold" run of gitstatusd, which happens when you `cd` into a repo for the first time. Users obviously can tolerate higher prompt latency in this case but faster is still better.


inotify requires you to hold open fds for all the dirs and files you're watching iirc, so in repos that large, you'll be crushed by the file-descriptor-per-process limit.


It doesn't, you might need to tweak its limits via /proc/sys/fs/inotify/max_* on such repositories, but those limits are not the same as the much lower open FD limits (as in ulimit -n ...).


I've posted some details explaining why gitstatusd doesn't use inotify in https://github.com/romkatv/gitstatus/commit/050aaaa04b652e15.... The short version is that the default max_user_watches limit is much too low to be useful for gitstatusd and I cannot ask or expect users to change their system settings.


Your documentation indicates that you've tried with core.untrackedCache, but have you tried core.fsmonitor with the watchman hook?

Obviously your "gitstatus" is still generally useful, but if your main itch you're trying to scratch is to have a faster "status" in a particular repo then you should be able to get that down to a few milliseconds with "git status", if you're willing to have watchman sit in the background and monitor it.


> [...] have you tried core.fsmonitor with the watchman hook?

I read the docs but it seems like it's not something I can enable on users' machines. Or maybe there is a way to take advantage of it even if it's not enabled? I really haven't looked much into it.

To clarify, my main motivation isn't to make my own prompt latency low (I don't even work on large git projects) but to make Powerlevel10k as good a product as possible. Git latency is a pain point for the existing users, so I work on optimizing it.

I also cannot rely on users to enable untracked cache even if their filesystem allows it. So gitstatusd performs tests similar to `git update-index --test-untracked-cache` in the background and builds its own untracked cache if tests pass. This makes for a good user experience with no setup or configuration fiddling.


You could use it if available, and fall back to existing behavior if not. (Not sure if it would be faster, but probably worth experimenting with.)


Because `git status` takes a really long time if the repository is large. And no, the linux kernel is not considered a large repository, there are far larger ones around, mostly private at companies, such as Microsoft and others.


Because you might want to display an up-to-date status permanently in a UI of some sort (e.g. an IDE or editor status bar or a shell prompt — the latter seems to be the use case here), and on big repositories it can be noticeable. Takes 150ms on the rust repository on my machine, with warm cache, that's a serious stutter when it runs on every shell prompt.

The tool's page specifically quotes Chromium which takes ~300ms to status on the author's system.


Why create a computer when an abacus counts just as effectively :) Time, precious time! We're all born with a limited number of breaths, let's not waste them sighing while waiting for a simple task to complete. In this way faster tooling is a very real method otherwise pure-tech projects can prolong the lives of people. Fix a slow popular tool and the effect is multiplied


Why be slow when you can be fast?


- So you can be fast at changing direction.

- So you mitigate the damage if you crash.


> As an added bonus, those casts will fend off the occasional frontend developer who accidentally wanders into the codebase.

https://blog.aurynn.com/2015/12/16-contempt-culture

Code that intentionally celebrates its unapproachability is not a badge of honor or pride, even as a joke. It might occasionally be an unavoidable necessity, in which case it needs enough documentation that the next person who has to deal with it has the best chance possible.


You are quoting a self-deprecating joke. I care deeply about my code being readable, which can be quite challenging when writing high-performance code close to the metal. I'm self-consciously aware of every tricky bit I put in my code. I have to be confident it's worth it (usually because it makes things faster in the real world) and offset the added complexity by writing documentation.


I mean, I think that was a bit of a joke. It's ironic that you rail against this dude for "code that's hard to understand" in reply to a long-form prose article about understanding the code and how it got there..

We'll probably have to agree to disagree.. but that blog post doesn't really resonate for me... especially the end where it talks about keep it to your "own language." I've never had patience for identity politics especially in the case of a god-damn programming language.


> It's ironic that you rail against this dude for "code that's hard to understand"

I'm sure that it was a joke, and that doesn't make it better. I'm not suggesting that this code was hard to understand, and I enjoyed the article. I'm calling out this particular point, and suggesting that even as a joke, mocking "frontend developers" is very much from the same territory that promotes "Real Programming" and mocks people for their choice of programming language or technology.


I mean it is pretty true that most frontend developers do not have the inclination to go trodding into a C++ code base making syscalls without a wrapper. Even taken uncharitably seriously, that is all the comment was insinuating. Everything else is added by you the reader.

And anyway I'm fine with exclusively frontend devs staying out of systems stuff. Anyone that wants to read OS kernel and compiler source code has tons of choices. Most I've talked to find it dry and boring. They don't belong there... that's fine.


This comment is off topic and trying to bait an argument off of a single throwaway line in the article.


That was clearly tongue-in-cheek. No need to rail against the author. Lay down the outrage pitchforks man.


Where are the outrage pitchforks? The comment was extremely calm. Why is any mild, polite criticism or disagreement perceived as "railing against the author"??




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: