Hacker News new | past | comments | ask | show | jobs | submit login
Why does musl make my Rust code so slow? (andygrove.io)
154 points by andygrove on May 5, 2020 | hide | past | favorite | 72 comments




Thanks! That really does seem to be the issue and I wouldn't have known about this, had I not asked. I will try this out and will update the blog post in ~8 hours time.


IME allocations is one of the main things making rust programs slow without diving into the more arcane stuff. So looking into unnecessary allocations and/or the performances of the allocator would be one of the first things to do (right after checking if you're compiling with optimisations).

Given your CPU graphs, and the large number of cores, I expect musl's allocator simply has very poor behaviour with respect to multithreading (e.g. limited or no threadlocal arenas, size-classing, etc…) leading to a lot of crosstalk, extreme contention on allocations, etc...


Tbh allocations (and related memory management things) are often the low hanging fruit in big picture optimization in many languages, incl c++.


Without familiarity with rust, I wasn't sure what they meant by "system allocator". Apparently that means libc's malloc. (Or HeapAlloc on Windows)

So I guess they statically link jemalloc but can optionally use libc malloc.


It's the other way around now, though the post that iou linked is from an earlier period when things worked differently. By default today, Rust programs use the same allocator that C programs use, which I think is provided by libc on Linux, and you have the option of using a custom allocator, like jemalloc. Historically however, all Rust programs used to use jemalloc by default.


I suppose there's no option to link statically against glibc because of the implications with LGPL (static linking triggers additional license clauses).


I think it's mainly that glibc has poor support for statically linking.


Ah, okay. Searched a bit, and it apparently requires you to find your own with dns client lookup library, avoid dlopen(), set GCONV_PATH, and so on.


For a long time glibc was maintained by someone who was of the opinion that if you wanted to statically link 3rd party libraries, then you shouldn't be allowed to build binaries.


It's interesting because there is an opposite cult (I believe some prominent plan9 and go people are adherents) that believes dynamic linking should literally not be a thing. I even recall reading some essays that introduction of a dynamic linker was some kind of tragic downfall in Unix history.

I think the truth is that both have costs and benefits. Dynamic linking is good for security patches, memory and disk usage. But it creates new opportunities for problems, for example ABI breakage becomes more significant of a problem and needs a lot of care to avoid. People distributing code, be it to end users or app stores or on servers with things like chroots, jails and containers, need to carry their dependencies anyway negating some of the benefits.


Note also that while it probably is a net-win for security to use dynamic linking, it's not 100% a win. I've seen security vulnerabilities introduced because of upgrades to dependencies.


Well, and LD_LIBRARY_PATH type vulnerabilities.


In general, do not expect dlopen() to work in statically linked programs (in any libc).


A key phrase I found googling "rust system allocator" is "provided by the operating system". On Unix, a dynamically linked libc would certainly fit the bill.

On Windows the C runtime is technically provided by the compiler and its runtime, so even that would make sense why they use HeapAlloc from Win32. (It's also common on Windows for different DLLs in the same process to have separate, incompatible libc mallocs, so I guess it might solve that too.)


Also, it’s extremely common to leave libc dynamically linked for operational reasons. It’s my understanding that by default Go statically links everything but libc.


Go doesn't link libc at all, generally.


Not on Linux, with some exceptions (it wants to play nice with nsswitch, after all). It does/has to on macOS, especially if you want your binaries to work on more than a single system release.


On Linux it has avoided libc for nsswitch in the common case since 2015: https://github.com/golang/go/issues/10485

Golang has only very recently (in its lifetime) and grudgingly admitted that operating systems that provide ABI stability at the DSO layer, rather than the Syscall layer, exist. It has been Linux-first for most of its life, and on Linux syscalls are the ABI stability layer. Not so pretty much anywhere else.

On MacOS, Go links libSystem rather than libc. And this was new in 2018: https://golang.org/doc/go1.11#runtime despite having a Mac port since ~2012. Prior to that they ignored the system ABI stability layer and just did raw Darwin syscalls. They still do so on the BSDs, despite this explicitly not being the supported stability interface.


I stand corrected.


Sure. Wasn't suggesting it as the default. Just curious why it wasn't an option.


I always assumed that that’s done for portability reasons, since libc deals with a lot of os specific details.


Symbol versioning for backwards compatibility is only possible with dynamic libraries, for example. Static linking is of course always backwards compatible for a given binary artifact, but is significantly bloatier and makes it harder to apply security fixes.


I wonder why glibc isn't just using jemalloc when it seems to perform much better than what it currently has.


Under most workloads jemalloc will use much more memory than ptmalloc (glibc).


I had the opposite experience. Jemalloc uses a lot less (virtual) memory, especially with multithreaded applications. The glibc allocator wastes quite a lot of memory mappings from a thread and never cleans them up even if a thread only had a burst of allocations (there is an upstream bug open about, they don't consider it a leak)


I was about to say nobody cares about virtual memory usage, but I can see how those who are extremely sensitive to runtime performance might care about page table entries or VMAs. I think most people are more concerned about RSS.


Just wanted to add that I strongly suggest thinking about the correct allocator for your applications and never using "the system allocator". In production context there should never be "a system allocator" anyway. Uninstall libc in prod. Force yourself to make a rational choice of memory allcoator.


> Uninstall libc in prod.

When you start to miss the thrill of it :D


For those curious, Musl's malloc implementation is currently being re-written for higher performance and robustness, see https://github.com/richfelker/mallocng-draft


Curiously, it doesn't adopt the now-standard approach for multithreaded support: per-thread memory pools, allowing one thread allocating and deallocating the same memory to avoid synchronization. This uses one lock guarding allocation, which means that it can be a bottleneck in a multithreaded workload.


The justifications are partly the same as what Daniel Micay has written extensively on in the rational for hardened_malloc (https://github.com/GrapheneOS/hardened_malloc) - unsynchronized per-thread state inherently sacrifices global consistency for performance and makes it impossible to detect a lot of types of memory usage errors (DF/UAF, etc) that could otherwise be caught.

However musl has the additional constraint of being compatible with small/very-low-memory environments. Lack of global consistency inherently means you will end up using memory less efficiently and requesting significantly more from the system. The new malloc about to go upstream in musl is, to my knowledge, the first/only advanced hardened allocator using slab-type design rather than traditional dlmalloc type split/merge, but also designed for extremely low overhead/waste at low to moderate usage rather than extreme performance. And in the vast majority of applications, this is perfectly reasonable. Even Firefox for example does very well with it.

With that said, new malloc is expected to be somewhat faster than old on lots of workloads (and considerably faster than old would be if we fixed the flaws in old that motivated it), but it's not a performance-oriented allocator. If you really want/need that you should probably link jemalloc or similar (and accept all the tradeoffs that come with that). In Rust programs without "unsafe", it may make sense to do that by default.


Thanks for the clear explanation. Looking at the source code, it looks similar to modern allocators, just without the per-thread heaps. (I think all modern allocators use size-class slab allocators for small objects.) Curiously, I don't think the academic community has much literature on hardened allocators. It's been a while since I've worked in the area, but I wasn't aware of any other than DieHard from 2006 [1]. I did some searched on the ACM Digital Library (I love that it's all free right now so I can easily provide links in forums), and the only other thing I could find was FreeGuard from 2017 [2]. Maybe the issue there is that academics who design memory allocators tend to be on the systems side of CS, and such people tend to use raw performance as a part of the evaluation. Better security for a new thing does not show up in a graph. (Even that FreeGuard paper from 2017 claims security with better performance.)

In the non-academic world, I found the one we're discussing, but also Scudo (https://llvm.org/docs/ScudoHardenedAllocator.html). And that's it. If I still worked in the area, I would try to go after scalable hardened allocators. I wonder if there's still some clever stuff we haven't thought of there.

[1] https://github.com/emeryberger/DieHard, https://dl.acm.org/doi/abs/10.1145/1133981.1134000

[2] https://github.com/UTSASRG/FreeGuard, https://dl.acm.org/doi/abs/10.1145/3133956.3133957


> However musl has the additional constraint of being compatible with small/very-low-memory environments.

How many threads do these have ?

If they only have one thread, they'll use 72x less memory than if they would have 72 threads.

The thing is that if you are using 72 threads you probably would like your application to be 72x faster than if you are using only one. So synchronizing all allocations and killing scalability doesn't solve these users problems.

Most allocators, including jemalloc, tcmalloc, mimalloc, etc. have a "hardened" mode, that people can opt into if they want.

If I'm using Rust like the user in the blog post, double frees are caught at compile-time, so I'd rather not pay for them at run-time.


Not less than two weeks ago the DragonFly kernel allocator made related improvements for very high core CPUs.

https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/018...


Do you have any extra readings on the rationale of building their own malloc rather than integrating mimalloc or jemalloc?


Not any first hand, but reading their principles suggests that simplicity and ease of deployment are probably relevant. https://musl.libc.org/about.html


Thanks. Taking on malloc and being competitive with much less code (mimalloc is around 6k, and it is the smallest I know that is still competitive) would be a great feat. Would be interesting to follow the development.


well one of the stated goals of musl is to be simple and correct, and all those mallocs are anything but simple


That's true for jemalloc, but mimalloc is pretty simple. The reference paper is pretty short and really accessible and IIRC the implementation is around 5klocs. I doubt musl's implementation would much simpler than this.


5kloc is about 10x larger than musl's existing (old) malloc in source lines. I suspect lots of that is low code density, comments, etc.

I have to lookup what exactly mimalloc is/does every time someone mentions it, because the readme/documentation isn't very descriptive except discussing extensions outside the normal API. I didn't have time to dig through this again today. But I did look at it in some depth on several occasions in the past and it really wasn't suitable for or comparable to what we're doing in musl.


Thanks for clarification, didn't imagine that musl's malloc was that minimalist.

You should definitely have a look at the paper [1] it's only ten page long! (Excluding benchmarks and references)

[1]: https://www.microsoft.com/en-us/research/uploads/prod/2019/0...


This is a libc implementation. The null hypothesis is that it implements libc, rather than porting a different libc implementation.


We'd be happy to address specific problems on the mailing list. I believe it's a known issue that the Rust compiler is making really heavy use of rapid allocation/freeing cycles, and would benefit from linking a performance-oriented malloc replacement. Doing so is inherently a tradeoff between many factors including performance, memory overhead, safety against erroneous usage by programs, etc.

One statement in your post, which some readers pointed out was apparently added later, "Others have suggested that the performance problems in musl go deeper than that and that there are fundamental issues with threading in musl, potentially making it unsuitable for my use case," seems wrong unless they just meant that the malloc implementation is not thread-caching/thread-local-arena-based. The threads implementation in musl is the only one I'm aware of that doesn't still have significant bugs in some of the synchronization primitives or in cancellation. It's missing a few optional and somewhat obscure features like priority-ceiling mutexes, and Linux doesn't even admit a fully correct implementation in some regards like interaction of thread priorities with some synchronization primitives, but all the basic functionality is there and was written with extreme attention to correctness, and musl aims to be a very good choice in situations where this matters.


...Not the Intel guy, if anyone else had to pause for a second.


I get that a lot!


You still being alive probably helps for disambiguation. :-)


Are you familiar with SwiftOnSecurity on twitter?

Do you have any hobbies that would be out of character for Intel's Andy Grove? I think the world has room for a ficttionalized Andy Grove talking about how to cook french pastries, train bonsai, intermittent fasting, or preparing for a marathon.


One SwiftOnSecurity is already too many.


It's a free country. You're allowed to be wrong.


Swapping out the allocator for jemalloc would be my first try. It's easy to do and often results in better performance. 30x requires some kind of pathological case though.


In a commercial product I worked on I went against the vendors advice to try out jemalloc. It took a 100GB memory hold (that took 48 hours to happen) to staying steady at around 2-4GB and only peaking at 100GB for a few seconds a day.

Same exact code but just swapped out the jemalloc at the command line.


.. where's the profile output?


run a perf trace on both and see what jumps out


You may not even need to go that deep.

Just strace (follow forks) and look what commands get exec'd.

"Why does musl make my Rust code so slow?" But he's measuring mostly the compiler performance in "cargo build". Is he writing the same amount of data to disk in the same experiments? Seems like there's a lot of opportunity for some shallow investigation to find out more.


If he's benchmarking on docker, I'm not sure that perf works in docker.


Then benchmark outside docker?


Depends on the kernel.perf_event_paranoid sysctl.


Post was not very illuminating. Very little content. It's pretty much a "if musl is slow, it may be the allocator (eom)" which fits in the headline and would have saved me the click.


This actually someone asking and not an investigation and explanation. There isn't even a lot of due diligence to figuring it out - no profiling or resource usage other than CPUs. Also it is musl combined with docker causing a 30x slowdown.

If something is running 30x slower from linking in a different libc, I'm guessing it should not be that difficult to narrow down the cause at least a little bit.


Yeah, upvoted. 30x slower on 48 core (?) system sounds suspiciously like excessive lock contention (or some other shared resource). Non-NUMA aware allocator (or other code) might also contribute to the issue.

Should be fairly easy to investigate.


He has so many layers in there it's going to be tough to find the problem.

"Ballista is an experimental distributed compute platform, powered by Apache Arrow, with support for Rust and JVM (Java, Kotlin, and Scala)."

Plus he's got Docker, the Rust library, musl, and jemalloc sometimes. There's no application. All this is just infrastructure.

Musl doesn't do much on its own. But it does do stdio buffering. Could it be that the buffering system is making too many I/O calls, like flushing on every write?


Excessive I/O is indeed another common issue. The number of times I've seen a slow system and discovered excessive number of flushes, often in something like logging system to be the root cause... Or gazillion of unbuffered 1 byte writes or reads.


Exactly, not sure why you're downvoted. His blog-post doesn't explain why it is slow but instead 'switches the allocator' and says he has "fixed" the issue. This is just a workaround, not a fix.

I'm quite surprised that there's no mention of profiling the actual allocator causing this regression to properly narrow the fault down to the source. Instead this blog-post encourages cargo-culting development to "fix slow code".


Yeah, you should be upvoted.

The author put zero effort into figuring things out.


Benchmarking in Docker in general is a mistake I believe.


Why? The only overhead you have in Docker is on syscalls (due to permission checks, namespaces, ...), everything else runs at 100% native speed - unlike assisted virtualization (at least IOMMU overhead plus overhead for anything involving the filesystem) or emulated virtualization (obvious overheads here).


If both sides of the comparison run inside Docker that is still a valid comparison. With Docker the benchmark should be a bit more reproducible by anyone.


There are serious and undocumented problems in Docker. For instance, I spent multiple days investigating an issue in my tests that eventually I realized was due to accessing memory returned by an `mmap` syscall on a Docker mounted FS causing a SIGBUS, but only some of the time.

That is super dangerous and shook my confidence in Docker.


I think it makes sense for software that is intended to run in Docker and frameworks like Kubernetes that use Docker.


You should probably be measuring your app's performance in a production-like environment though.


You should load test in a realistic environment. Comparative benchmarks should be done in a consistent environment.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: