Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)

hinkley · 2024-08-21T18:04:41 1724263481

We had learned helplessness on a drag and drop bug in jquery UI. I had like three hours every second or third Friday and would just step through the code trying to find the bug. That code was so sketchy the jquery team was trying to rewrite it from scratch one component at a time, and wouldn’t entertain any bug discussions on the old code even though they were a year behind already.

After almost six months, I finally found a spot where I could monkey patch a function to wrap it with a short circuit if the coordinates were out of bounds. Not only fixed the bug but made drag and drop several times faster. Couldn’t share this with the world because they weren’t accepting PRs against the old widgets.

I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.

giancarlostoro · 2024-08-21T18:59:17 1724266757

One of my favorite most elusive bugs was a one liner change. I didn't understand the problem because nobody could reproduce it, or show it. Months later, after my boss told his boss it was fixed, despite never being able to test that it was fixed, I figured it out and fixed it. We had a gift card form, and we stored it in localStorage, if for any reason the person left the tab, and came back months later, it would show the old gift card with its old dated balance, it was a client-side bug. The fix was to use sessionStorage.

arghwhat · 2024-08-21T19:41:27 1724269287

For web, my favorite is JIT miscompilations. A tie between a mobile Safari bug that caused basic math operations to return 0 regardless of input values (basic, positive Numbers, no shenanigans), or a mobile Samsung browser bug where concatenating a specific single-character string with another single-character string would yield a Number.

Debugging errors in JS crypto and compression implementations that only occur at random, after at least some ten thousand iterations, on a mobile browser back when those were awful, and only if the debugger is closed/detached as opening it disabled the JIT was not fun.

It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.

hinkley · 2024-08-21T21:11:36 1724274696

I think you might use “favorite” the way I mean “fun” (if I say fun at work, it’s because we are having none)

A lot of my opinions on code and the human brain started in college. My roommate was washing out and didn’t know it yet. The rules about helping other people were very clear, I was a boy scout but also grade-a bargainer and rationalized so I created a protocol for helping him without getting us expelled. Other kids in the lab started using me the same way.

There were so many people who couldn’t grasp that your code can have three bugs at once, and fixing one won’t make your code behave. Some of those must have washed out too.

But applying the scientific method as you say above is something that I came to later and it’s how I mentor people. If all of your assumptions say the answer should be 3 but it’s 4, or “4” or “Spain”, one of your assumptions is wrong and you need to test them. Odds of being the flaw / difficulty of rechecking. Prioritize and work the problem.

(Hidden variable: how embarrassed you’ll be if this turns out to be the problem)

giancarlostoro · 2024-08-21T20:19:28 1724271568

There's a weird one I ran into, and I for the life of me do not remember which project it was under, but if I open dev tools, the style changes for an element, then if I close dev tools, the style goes back to normal. I never could figure out what the heck is going on. I almost want to blame the viewport size changing just slightly, but I couldn't find a single CSS rule that would make it make sense, and I think even popping it out, it behaved exactly the same. It was frustrating, but I felt like I had to ignore it since no normal user would ever see it, it just made debugging with dev tools confusing.

Edit: In your case, that's where I start print debugging LOL

bch · 2024-08-22T00:48:33 1724287713

This reminds me of finding out (the Hard Way) that Perl’s sort used internally[0] the same $a or $b that I happened to be using.

[0] https://perldoc.perl.org/functions/sort

If the subroutine's prototype is ($$), the elements to be compared are passed by reference in @_, as for a normal subroutine. This is slower than unprototyped subroutines, where the elements to be compared are passed into the subroutine as the package global variables $a and $b (see example below).

hinkley · 2024-08-21T21:17:34 1724275054

println(“1”);

…

println(“2”);

…

println(“wtf”);

contingencies · 2024-08-21T19:33:42 1724268822

It seems in the context of your story the old adage that organizations reproduce software in their own architecture again rings true, with multilayered bureaucracy, lies and promises resulting in "client state".

giancarlostoro · 2024-08-21T20:22:03 1724271723

When I tried to explain to him that I fixed the thing he claimed to have fixed, I heard him hesitantly say it wasn't the same bug. Not sure what he told his boss this time around the fix was for, but I was able to fully reproduce the bug with this fix.

If you can't reproduce a bug, you cannot in my opinion say that it is fixed. If you have to reproduce it via local debugging and changing a value, or hard coding a value, I think you're possibly close, but there's a chance it might not be the case!

hinkley · 2024-08-21T21:19:21 1724275161

The thing with saying “I don’t know” is if you use it judiciously, people believe you more when you say you can do something they think can’t be done.

If he didn’t know he would just say. But he says he does.

giancarlostoro · 2024-08-21T23:32:37 1724283157

He just did not want to keep his boss waiting longer than a week without a "solution" so he forced something to wash his hands of it.

WalterBright · 2024-08-21T20:33:50 1724272430

My longest one was an uninitialized declaration of a local variable, which acquired ever-changing values.

This is why D, by default, initializes all variables. Note that the optimizer removes dead assignments, so this is runtime cost-free. D's implementation of C, ImportC, also default initializes all locals. Why let that stupid C bug continue?

Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.

This is why D guarantees that all fields are initialized.

hinkley · 2024-08-21T21:22:39 1724275359

The first bug I remember writing was making native calls in Java to process data. I didn’t understand why in the examples they kept rerunning the handle dereference in every loop.

If native code calls back into Java, and the GC kicks in, all the objects the native code can see can be compacted and moved. So my implementation worked fine for all of the smaller test fixtures, and blew up half the time with the largest. Because I skipped a line to make it “go faster”.

I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.

ckocagil · 2024-08-21T22:19:17 1724278757

Valgrind didn't catch it?

WalterBright · 2024-08-22T00:39:46 1724287186

It was before valgrind. Besides, valgrind isn't always available, it requires all code paths to be tested, and it can be really slow making it impractical for some code.

Default initialization, on the other hand, gives 100% coverage. Experience with it in D is a satisfying success.

lttlrck · 2024-08-22T00:19:13 1724285953

May have been impractical to run valgrind eg embedded. A decent compiler should catch this anyway? -Wall?

WalterBright · 2024-08-22T00:47:16 1724287636

Testing all code paths means all code paths must be available to the compiler. This is not always practical.

kibwen · 2024-08-21T16:44:03 1724258643

Level 1 systems programmer: "wow, it feels so nice having control over my memory and getting out from under the thumb of a garbage collector"

Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"

matklad · 2024-08-21T17:16:24 1724260584

The answer is clear: just don’t have a malloc implementation in your process' address space!

thebruce87m · 2024-08-21T17:34:13 1724261653

Welcome to embedded! It’s no heaps of fun!

eschneider · 2024-08-21T19:25:30 1724268330

I'm always surprised how much I don't miss dynamic allocation. :)

hinkley · 2024-08-21T18:21:46 1724264506

> no heaps

Angry upvote

sgt · 2024-08-21T19:40:48 1724269248

pg needs to build that. Hold upvote icon for 5 secs=angry upvote

hinkley · 2024-08-21T21:00:26 1724274026

“I’m doing this as hard as I can”

poikroequ · 2024-08-21T17:23:39 1724261019

A bump allocator is all anyone really needs

mcguire · 2024-08-21T18:47:45 1724266065

"Eh, it'll crash before it runs out of memory."

zokier · 2024-08-21T18:54:42 1724266482

In some cases in a very literal sense (cue story about missiles)

seanthemon · 2024-08-21T16:55:48 1724259348

At the very bottom of everything is a garbage collector..

hinkley · 2024-08-21T18:24:10 1724264650

Soil is just the biggest swap meet in the world. Where every microbe, invertebrate and tree is just looking for someone else’s trash to turn into treasure.

riwsky · 2024-08-21T17:34:42 1724261682

Market forces: the ultimate garbage collector

ckocagil · 2024-08-21T22:06:40 1724278000

"stackoverflow please help me how do i fix memory fragmentation"

amelius · 2024-08-21T16:57:34 1724259454

Level 3 system programmer: "get me out of this straight jacket and give me my garbage collector back so I can get stuff done"

ComputerGuru · 2024-08-21T17:23:06 1724260986

That's not how system programmers think..

troutwine · 2024-08-21T17:41:18 1724262078

I agree. If we were to try and pin a thought process to an additional level of systems programmer it’d involve writing an allocator that’s custom to your domain. The problem with garbage collection for the systems’ case is you’re opting into a set of undefined and uncontrolled runtime behavior which is okay until it catastrophically isn’t. An allocator is the same but with less surface area and you can swap it at need.

amelius · 2024-08-21T19:00:11 1724266811

Meanwhile an OS uses the filesystem for just about everything and it is also a garbage collected system ...

Why should memory be different?

troutwine · 2024-08-21T19:16:28 1724267788

I'm not tracking how your question follows. If by garbage collection you mean a system in which resources are cleaned up at or after the moment they are marked as no longer being necessary then, sure, I guess I can see a thread here, although I think it a thin connection. The conversation up-thread is about runtime garbage collectors which are a mechanism with more semantic properties than this expansive definition implies and possessing an internal complexity that is opaque to the user. An allocator does have the more expensive definition I think you might be operating with, as does a filesystem, but it's the opacity and intrinsic binding to a specific runtime GC that makes it a challenging tool for systems programming.

Go for instance bills itself as a systems language and that's true for domains where bounded, predictable memory consumption / CPU trade-offs are not necessary _because_ the runtime GC is bundled and non-negotiable. Its behavior also shifts with releases. A systems program relying on an allocator alone can choose to ignore the allocator until it's a problem and swap the implementation out for one -- perhaps custom made -- that tailors to the domain.

amelius · 2024-08-21T20:54:15 1724273655

An OS has the job of managing resources, such as CPU, disk and memory.

It is easy to understand how it has grown historically, but the fact that every process still manages its own memory is a little absurd.

If your program __wants__ to manage its own memory, then that is simple: allocate a large (gc'd) blob of memory and run an allocator in it.

The problem is that the current view has it backwards.

201984 · 2024-08-22T01:32:45 1724290365

An OS would have a very hard time determining whether a page is "unused" or not. Normal GCs have to know at least which fields of a data structure contain pointers so it can find unreachable objects. To an OS, all memory is just opaque bytes, and it would have no way to know if any given 8 bytes is a pointer to a page or a 64-bit integer that happens to have the same value. This is pretty much why C/C++ don't have garbage collectors currently.

amelius · 2024-08-22T09:40:13 1724319613

> To an OS, all memory is just opaque bytes, and it would have no way to know if any given 8 bytes is a pointer to a page or a 64-bit integer that happens to have the same value.

This is like saying to an OS all file descriptors are just integers.

201984 · 2024-08-22T11:23:53 1724325833

That's because they are :P

I doubt GC would work on file descriptors either. How could an OS tell when scanning through memory if every 4 bytes is a file descriptor it must keep alive, or an integer that just happens to have the same value?

Not to mention that file descriptors (and pointers!) may not be stored by value. A program might have a set of fds and only store the first one, since it has some way to calculate the others, eg by adding one.

gpderetta · 2024-08-22T15:21:37 1724340097

A gargbage collector need not be conservative. Interestingly linux (and most posix compliant unices I guess) implements, as last resort, an actual tracing file descriptor garbage collector to track the lifetime of file descriptors: as they can be shared across processes via unix sockets (potentially recursively), arbitrary cycles can be created and reference counting is not enough.

LegionMammal978 · 2024-08-21T21:38:25 1724276305

The OS already does that, though? Your program requests some number of pages of virtual memory, and the OS uses a GC-like mechanism to allocate physical memory to those virtual pages on demand, wiping and reusing it soon after the virtual pages are unmapped.

It's just that programs tend to want to manage objects with sub-page granularity (as well as on separate threads in parallel), and at that level there are infinitely many possible access patterns and reachability criteria that a GC might want to optimize for.

PaulDavisThe1st · 2024-08-21T22:09:15 1724278155

AFAIK, no OS uses a "GC-like mechanism" to handle page allocation.

When a process requests additional pages be added to its address space, they remain in that address space until the process explicitly releases them or the process exits. At that time they go back on the free list to be re-used.

GC implies "finding" unused stuff among something other than a free list.

LegionMammal978 · 2024-08-22T00:23:29 1724286209

I was mainly thinking of the zeroing strategy: when a page is freed from one process, it generally has to be zeroed before being handed to another process. It looks like Linux does this as lazily as possible, but some of the BSDs allegedly use idle cycles to zero pages. So I'd consider that a form of GC to reclaim dirty pages, though I'll concede that it isn't as common as I thought.

the-smug-one · 2024-08-21T23:04:25 1724281465

> An OS has the job of managing resources, such as CPU, disk and memory.

The job of the OS is to virtualize resources, which it does (including memory).

jcelerier · 2024-08-23T01:37:35 1724377055

> Meanwhile an OS uses the filesystem for just about everything and it is also a garbage collected system ...

so many serious applications end-up reimplementing their own custom user-space / process-level filesystem for specific tasks because how SLOW can OS filesystems be though

amelius · 2024-08-21T17:30:40 1724261440

A new generation of system programmers is tired of solving the same old boring memory riddles over and over again and no borrow checker is going to help them because it only brings new riddles.

__s · 2024-08-21T17:54:38 1724262878

gc replaces riddles with punchlines

pjmlp · 2024-08-21T18:44:57 1724265897

Those of us that actually used systems programming languages with automatic resource management, do think that way.

Unfortunately science only evolves one funeral at a time.

mike_hearn · 2024-08-21T20:20:51 1724271651

It's how some think. Graal is a full compiler written in Java. There's a long history of JVMs and databases being written in GCd Java. I think you could push it a lot further too. Modern JVM GCs are entirely pauseless for instance.

pebal · 2024-08-22T21:55:43 1724363743

They are not. All Java GC introduce pauses, only short ones. SGCL for C++ is a true pauseless GC.

mike_hearn · 2024-08-23T09:14:59 1724404499

It depends how you define "pause" but no, modern GCs like ZGC and Shenandoah are universally agreed to be pauseless. The work done at the start/end of a GC is fixed time regardless of heap size and takes less than a millisecond. At that speed other latencies on your system are going to dominate GC unless you're on a hard RTOS like QNX.

forrestthewoods · 2024-08-21T17:05:48 1724259948

No. Just no.

For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.

0x457 · 2024-08-21T18:06:19 1724263579

What, you don't like doing GC only N requests (ruby web servers), disabling GC completely during working hours (java stock trading), fake allocating large buffers (go's allocate and don't use trick)?

mike_hearn · 2024-08-21T20:22:46 1724271766

The Java shops you're thinking of didn't disable GC during working hours, they just sized the generations to avoid a collection given their normal allocation rates.

But there were / are also plenty of trading shops that paid Azul for their pauseless C4 GC. Nowadays there's also ZGC and Shenandoah, so if you want to both allocate a lot and also not have pauses, that tech is no longer expensive.

0x457 · 2024-08-21T22:22:36 1724278956

> The Java shops you're thinking of didn't disable GC during working hours, they just sized the generations to avoid a collection given their normal allocation rates.

Well, I just trivialized it. However, in one case in mid 00s, I saw it disabled completely to avoid any pauses during trading hours.

pton_xd · 2024-08-21T18:23:43 1724264623

Ain't nothin' wrong with configuring V8 to have unbounded heap growth, disabling the memory reducer, and then killing the process after a while.

wpollock · 2024-08-21T20:26:44 1724272004

Used to do this in C decades ago. Worked on Unix but I doubt it works on Linux today, unless you disable memory overcommit completely.

gnuvince · 2024-08-21T19:11:50 1724267510

I need to find a pithy way to express "we use a garbage collector to avoid doing manual memory management because that'd require too much effort; but since the GC causes performance problems in production, we have spent more effort and energy working around those issues and creating bespoke tooling to mitigate them than the manual memory management we were trying to avoid in the first place would've required."

habibur · 2024-08-21T19:17:31 1724267851

RAII <-- best of both worlds.

chubot · 2024-08-21T19:22:12 1724268132

If you are talking about C++, it’s nice when RAII works. But if it does work, then in some sense your problem was easy. Async code and concurrent code require different solutions

neonsunset · 2024-08-21T18:51:39 1724266299

I'd wager it was an issue with the language of choice (or its GC) being rather poorly made performance-wise or a design that does not respect how GC works in the first place :)

forrestthewoods · 2024-08-21T20:06:14 1724270774

I’m sure you would! GC is like communism. Always some excuse as to why GC isn’t to blame.

> or a design that does not respect how GC works in the first place

It’s called shipping a 90 Hz VR game without dropping frames.

neonsunset · 2024-08-21T20:27:24 1724272044

Aside from finding the analogy strange and unfortunate, I assume you're talking about Unity, is that correct?

(if that is the case, I understand where the GC PTSD comes from)

o11c · 2024-08-21T23:26:20 1724282780

> I’m sure you would! GC is like communism. Always some excuse as to why GC isn’t to blame.

To be fair, there are about 4 completely independent bad decisions that tend to be made together in a given language. GC is just one of them, and not necessarily the worst (possibly the least bad, even).

The decisions, in rough order of importance according to some guy on the Internet:

1. The static-vs-dynamic axis. This is not a binary decision, things like "functions tend to accept interfaces rather than concrete types" and "all methods are virtual unless marked final" still penalize you even if you appear to have static types. C++'s "static duck typing" in templates theoretically counts here, but damages programmer sanity rather than runtime performance. Expressivity of the type system (higher-kinded types, generics) also matters. Thus Java-like languages don't actually do particularly great here.

2. The AOT-vs-JIT axis. Again, this is not a binary decision, nor is it fully independent of other axes - in particular, dynamic languages with optimistic tracing JITs are worse than Java-style JITs. A notable compromise is "cached JIT" aka "AOT at startup" (in particular, this deals with -march=native), though this can fail badly in "rebuild the container every startup" workflows. Theoretically some degree of runtime JIT can help too since PGO is hard, but it's usually lost in the noise. Note that if your language understands what "relocations" are you can win a lot. Java-like languages can lose badly for some workflows (e.g. tools intended to be launched from bash interactively) here, but other workflows can ignore this.

3. The inline-vs-indirect-object axis - that is, are all objects (effectively) forced to be separate allocations, or can they be subobjects (value types)? If local variables can avoid allocation that only counts for a little bit. Java loses very badly here outside of purely numerical code (Project Valhalla has been promising a solution for a decade now, and given their unwieldy proposals it's not clear they actually understand the problem), but C# is tolerable, though still far behind C++ (note the "fat shared" implications with #4). In other words - yes, usually the problem isn't the GC, it's the fact that the language forces you to generate garbage in the first place.

4. The intended-vs-uncontrollable-memory-ownership axis. GC-by-default is an automatic loss here; the bare minimum is to support the widely-intended (unique, shared, weak, borrowed) quartet without much programmer overhead (barring the bug below, you can write unique-like logic by hand, and implement the others in terms of it; particularly, many languages have poor support for weak), but I have a much bigger list [1] and some require language support to implement. try-with-resources (= Python-style with) is worth a little here but nowhere near enough to count as a win; try-finally is assumed even in the worst case but worth nothing due to being very ugly. Note that many languages are unavoidably buggy if they allow an exception to occur between the construction of an object and its assignment to a variable; the only way to avoid this is to write extensions in native code.

[1] https://gist.github.com/o11c/dee52f11428b3d70914c4ed5652d43f... - a list of intended memory ownership policies. Generalized GC has never found a theoretical use; it only shows up as a workaround.

neonsunset · 2024-08-22T00:18:31 1724285911

re 1. C# dispatch strategy is not Java-like: all methods are by default non-virtual unless specified otherwise. In addition, dispatch-by-generic-constraint for structs is zero-cost, much like Rust generics or C++ templates. As of now, neither OpenJDK nor .NET suffer from virtual and interface calls to the same extent C suffers from manually rolled vtables or C++ suffers from virtual calls. Because both OpenJDK/GraalVM and .NET have compilers that are intimately aware of the exact type system they are targeting which enable advanced devirtualization patterns. Notably, this also works as whole-program-optimization for native binaries produced by .NET's nativeaot.

re 4. there is some understanding gap in programming community to the kind of constraints imposed by lifetime analysis on dynamicity allowed by JIT compilation, which comes at a tradeoff of being able to invalidate previous assertions about when the object or struct truly no longer referenced, whether it escapes or else - you may be no longer able to re-JIT the method, attach a debugger or introduce some other change. There is still also lack of understanding where the cost of GC comes from and how it compares to other memory management techniques, or how it interacts with escape analysis (which in many ways resembles static lifetime analysis for linear and affine types), particularly so when it is inter-procedural. I am saying this as a response to "GC-by-default is an automatic loss" which sounds overly generalized "GC bad" you get used to hearing from audience who never looked at it with a profiler.

And lastly - latency-sensitive gamedev and predictability tends to come with completely different set of constraints to regular application code, and tends to require comparable techniques regardless of the language of choice provided it has capable compiler and GC implementations. It greatly favours low or schedulable STW pause GC (pause-less-like and especially non-moving designs tend to come with very ARC-like synchronization cost and low throughput (Go) or significantly higher heap sizes over actively used set (JVM pauseless GC impls. like Azul, maybe ZGC?), ideally with some or most collection phases being concurrent that performs best at moderate allocation rates. In the Unity case, there are quite a few poor quality libraries, as well as constraints of Unity specifically in regards to its rudimentary non-moving GC, which did receive upgrades for incremental per-frame collection but still would cause issues in scenarios where it cannot keep up. This is likely why the author of the parent comment is so up and arms about GC.

However, for complex frequently allocated and deallocated object graphs that do not have immediately observed lifetime constrained to a single thread, good GC is vastly superior to RC+malloc/free and can be matched by manually managing various arenas at much greater complexity cost, which is still an option in a GC-based language like C# (and is a popular technique in this domain).

forrestthewoods · 2024-08-22T01:48:20 1724291300

> I assume you're talking about Unity, is that correct

That particular project was Unity. Which, as you know, has a notoriously poor GC implementation.

It sure seems like there are a whole lot more bad GC implementations than good. And good ones are seemingly never available in my domains! Which makes their supposed existence irrelevant to my decision tree.

> good GC is vastly superior to RC+malloc/free

Ehhh. To be honest memory management is kind of easy. Memory leaks are easy to track. Double frees are kind of a non-issue. Use after free requires a modicum of care and planning.

> and can be matched by manually managing various arenas at much greater complexity cost, which is still an option in a GC-based language like C# (and is a popular technique in this domain).

Not 100% sure what you mean here.

I really really hate having to fight the GC and go out of my way to pool objects in C#. Sure it works. But it defeats the whole purpose of having a GC and is much much more painful than if it were just C.

Arnavion · 2024-08-21T19:47:03 1724269623

jemalloc also has its own funny problem with threads - if you have a multi-threaded application that uses jemalloc on all threads except the main thread, then the cleanup that jemalloc runs on main thread exit will segfault. In $dayjob we use jemalloc as a sub-allocator in specific arenas. (*) The application itself is fine in production because it allocates from the main thread too, but the unit test framework only runs tests in spawned threads and the main thread of the test binary just orchestrates them. So the test binary triggers this segfault reliably.

( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)

(*): The application uses libc malloc normally, but at some places it allocates pages using `mmap(non_anonymous_tempfile)` and then uses jemalloc to partition them. jemalloc has a feature called "extent hooks" where you can customize how jemalloc gets underlying pages for its allocations, which we use to give it pages via such mmap's. Then the higher layers of the code that just want to allocate don't have to care whether those allocations came from libc malloc or mmap-backed disk file.

CraigJPerry · 2024-08-21T19:54:08 1724270048

Tangent: what’s the ideal data structure for this problem?

If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.

The problem space seems to have a bunch of options to partition - by locality, by date etc.

I’m curious if there’s a commonly understood match for this problem?

FWIW with that dataset size, my first experiments would be with SQL server because that data will fit in ram. I don’t know if that’s where I’d end up - but I’m pretty sure it’s where I’d start my performance testing grappling with this problem.

jrpelkonen · 2024-08-21T22:51:28 1724280688

I think your premise is somewhat off. There might be 20 million hotel rooms in a world, but surely they are not individually priced, e.g. all king bed rooms in a given hotel have the same price per given day.

loeg · 2024-08-21T17:03:15 1724259795

Sort of tl;dr: mimalloc doesn't actually free memory in a way that it can be reused on threads other than the one that allocated it; the free call marks regions for eventual delayed reclaim by the original thread. If the original thread calls malloc again, those regions are collected (1/N malloc calls). Or (C) you can explicitly invoke mi_collect[1] in the allocating thread (the Rust crate does not seem to expose this API).

[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...

Arnavion · 2024-08-21T19:40:04 1724269204

The mimalloc crate just provides the GlobalAlloc impl that can be registered with libstd as the global allocator using the `#[global_allocator]` attr.

The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...

rurban · 2024-08-22T07:04:37 1724310277

The Annotated C++ Reference Manual:

“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”

IceTDrinker · 2024-08-21T18:18:32 1724264312

PSA: do not use floating point for monetary amounts

SAI_Peregrinus · 2024-08-21T20:07:12 1724270832

MS Excel uses floating point, and it's used a ton in finance. Don't use floating-point for monetary amounts if you don't know what rounding mode you've set.

koverstreet · 2024-08-21T20:33:23 1724272403

It's somewhat acceptable with double precision floats - never single precision floats.

But far better to just use integer cents.

IX-103 · 2024-08-21T23:06:38 1724281598

Integer cents implies a specific rounding mode (truncation). That's probably not what you should be using. Floating point cents gets the best of both worlds (if you set the right rounding mode).

nurettin · 2024-08-21T21:25:13 1724275513

I have used single precision floats in my latest project just to disprove this baloney.

smh · 2024-08-22T11:34:50 1724326490

You are using 32 bit floats to represent money?

Does your project correctly calculate $300,000.00 + $0.01, (or even just correctly represent the value $300,000.01) and if so, how?

nurettin · 2024-08-22T15:38:12 1724341092

Obviously you can't accumulate cent by cent. You can't even safely accumulate by quarter. Epsilon is too large to do that. I calculate cumulative pnl using std::fma, then multiply AUM with that and round to cents. It's good enough for backtesting, and it shaves a bunch of seconds off the clock.

smh · 2024-08-22T16:47:57 1724345277

I see - I guess it's a financial modelling program or similar where the quantities don't represent precise values of money. I was imagining some kind of accounting-like app that would need to be reconciled with real-world balances.

nurettin · 2024-08-22T19:37:11 1724355431

Yes, we work with futures and the largest transaction is around 1000 qty. It is so rare that most banks have a fat finger alert for such quantities.

zokier · 2024-08-21T18:52:56 1724266376

I wonder if there is something that could be done on language design level to have better "sympathy" to memory allocation, i.e. built upon having mmap/munmap as primitives instead of malloc/free; where language patterns are built around allocating pages instead of arbitrarily sized objects. Probably not practical for general high-level languages, but for e.g. embedded or high-performance stuff might make sense?

PaulDavisThe1st · 2024-08-21T22:13:41 1724278421

This seems to fail to understand that we already have both levels.

Every OS will provide some mechanism to get more pages. But it turns out that managing the use of those pages requires specialized handling, depending on the use case, as well as a bunch of boilerplate. Hence, we also have malloc and its many, many cousins to allocate arbitrary size objects.

You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...

eschneider · 2024-08-21T19:28:44 1724268524

In general for embedded, you don't page memory even if you're running something like embedded linux.

For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.

loeg · 2024-08-21T19:31:54 1724268714

Not exactly what you're getting at, but you could maybe imagine an explicit version of malloc where allocations are destined either for thread-local only use, or shared use. Then locally freeing remote thread-local memory is an invalid operation and these kinds of assume-locality optimizations are valid on many structures. I think you can imagine a version of mmap that allows for thread-local mappings to help detect accidental misuse of local allocation.

bsder · 2024-08-21T23:11:38 1724281898

Zig passes allocators around explicitly. There is no implicit memory allocator.

The downside is that it makes things like "print" a pain in the ass.

The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).

dathinab · 2024-08-21T20:19:37 1724271577

most modern memory allocators use internally mmap, this is why it most times makes sense to not use the system allocate for long running programs

Generally given that page size isn't something you know at compiler (or even install size) and it can vary between each restart and it being between anything between ~4KiB and 1GiB and most natural memory objects being much less then 4KiB but some being potentially much more then 1GiB you kind don't want to leak anything related to page sizes into your business logic if it can be helper. If you still need to most languages have memory/allocation pools you can use to get a bit more control about memory allocation/free and reuse.

Also the performance issues mentioned have not much to do with memory pages or anything like that _instead they are rooted in concurrency controls of a global resource (memory)_. I.e. thread local concurrency syncronization vs. process concurrency synchronization.

mainly instead of using a fully general purpose allocator they used an allocator whiche is still general purpose but has a design bias which improves same-thread (de)allocation perf at cost of cross thread (de)allocation perf. And they where doing a ton of cross thread (de)allocations leading to noticeable performance degradation.

The thing is even if you hypothetically only had allocations at sizes multiple of a memory page or use a ton of manual mmap you still would want to use a allocator and not always directly free freed memory back to the OS as doing so and doing a syscall on every allocation tends to lead to major performance degradation (in many use cases). So you still need concurrency controls but they come at a cost, especially for cross thread synchronization. Even just lock-free controls based on atomic have a cost over thread local controls caused often largely by cache invalidation/synchronization.

PaulDavisThe1st · 2024-08-21T22:16:50 1724278610

A perfect demonstration of how many of harder problems we face writing (especially non-browser-based) software are in fact not addressed by language changes.

The concept of memory that is allocated by a thread and can only be deallocated by that thread is useful and valid, but as TFA demonstrates, can also cause problems if you're not careful with your overall architecture. If the language you're using even allows you to use this concept, it almost certainly will not protect you from having to get the architecture corect.

the-smug-one · 2024-08-21T23:03:08 1724281388

I think Rust's language design is in part to blame, as it does not force the programmer to think sufficiently of the layout of the memory, instead allowing them to defer to a "global allocator".

PaulDavisThe1st · 2024-08-21T23:44:16 1724283856

This identical problem could easily occur in a C or C++ codebase.

the-smug-one · 2024-08-22T04:22:04 1724300524

I never said that C and C++ doesn't suffer from the same design problem? I'd say that Zig is the best in class here, typically forcing you to pass along an allocator to each data structure. C is a bit better than C++, as it uses an allocator explicitly, while C++ relies on new/delete with a default impl calling malloc/free.

Still a language design issue: C++ and Rust doesn't put allocation concerns front and center, when they very much are. Not encouraging thinking about these things is very bad for systems languages.

PaulDavisThe1st · 2024-08-22T18:25:38 1724351138

This issue isn't really about per-structure allocators at all.

It's about the idea that you are using per-thread allocators, and one of your threads allocates a lot of memory, then goes to sleep for a long time.

Per-thread allocators are orthogonal to per-structure allocators.

the-smug-one · 2024-08-31T11:43:59 1725104639

I think you misunderstand what I'm saying, as what I'm describing allows for per-thread allocators.

znpy · 2024-08-21T18:26:31 1724264791

> Allocators have different characteristics for a reason - they do some things differently between each other. What do you think mimalloc does that could account for this behavior?

Interestingly, it would seem that Java programmers play with garbage collectors while Rust programmers play with memory allocators.

sbt567 · 2024-08-23T06:55:38 1724396138

> Rust programmers

*system

malkia · 2024-08-21T21:55:28 1724277328

In C++, your https://en.cppreference.com/w/cpp/memory/new/new_handler should call mi_collect.

Exuma · 2024-08-21T19:40:12 1724269212

I really love the design of this blog

bsder · 2024-08-21T23:06:48 1724281608

Welcome to systems programming. Allocators are invisible--until they aren't.

om8 · 2024-08-21T20:18:27 1724271507

TLDR: use shitty allocators, win shitty memory leaks

akira2501 · 2024-08-21T18:57:05 1724266625

[flagged]

Patryk27 · 2024-08-21T19:12:55 1724267575

Not sure why all the hostility here - you haven't seen the code, know nothing about the domain and yet you're certain that our performance requirements are false and that "code base got sacrificed" (apparently by adding two lines of rather self-explanatory code?)

Feels like you've just read grugbrain.dev and decided to shoot your golden tips at everybody without actually trying to understand the situation.

Anyway, there's one good point here:

> why is your allocator in this path, then?

Because those prices change 24/7/365, million times a day, and so refreshing happens pretty much all the time in the background, eating CPU time. What's more, calculating prices is much more complicated than a hashmap lookup - hotels can have dynamic number of discounts, taxes etc., and they can't all be precomputed (too many combinations).

You know, not all complexity is made up, a little trust in others won't hurt.

Brentward · 2024-08-21T20:54:36 1724273676

I agree that the comment sounded pretty hostile, but I also agreed with the assessment that it might be better to avoid allocation in general. I know the code in the post is highly simplified, but you aren't exactly fully using "lazy iterators," at least in the post. refresh/load_hotels is still allocating new Hotels, each of which contains a one-million-element Vec. If you could reuse the existing self.hotels[id].prices Vec allocations, that might help, again at least in the example code.

On second glance, I guess this is what you're getting at with the ArcSwap comment in the post. It sounds like you really do want to reallocate the whole State and atomically swap it with the previous state, which would make reusing the prices Vec impossible, at least without keeping around two States and swapping between them.

Anyway, tone aside, I still think the comment had validity in a general optimization sense, even if not in this specific case.

Patryk27 · 2024-08-21T21:42:02 1724276522

Yeah, the example is not representative - in reality there's a couple of smaller vectors (base prices, meal prices, supplements, discounts, taxes, cancellation policies, ...) + lots of vectors containing indices (`for 2018-01-01, match prices #10, #20, #21`, `for 2018-01-02, match prices #11, #30`, ...), and they can't be actually updated in-place, because that would require using `RwLock`, preventing the engine from answering queries during refreshing.

(which is actually how it used to work a couple of years ago - ditching `RwLock` for `ArcSwap` and making the entire engine atomic was one of the coolest changes I've implemented here, probably worth its own post)

scottlamb · 2024-08-21T22:21:23 1724278883

Makes perfect sense to me for the updates to happen atomically and avoid causing lock contention, even if that makes the loader more allocation-happy than it'd be otherwise. I've done similar things before.

What about the query path? Your post talked about 10% improvement in response latency by changing memory allocators. That could be due to something like one allocator making huge page use more possible and thus vastly decreasing TLB pressure...but it also could be due to heavy malloc/free cycles during the query getting sped up. Is that happening, and if so why are those allocations necessary? Ignoring the tone, I think this is more what akira2501 was getting at. My inclination would be to explore using per-request arenas.

Patryk27 · 2024-08-22T10:16:29 1724321789

Per-request arenas are a nice idea, yeah - we haven't pursued the optimizations more, because we're satisfied with the current speed, but certainly there are some things that could be improved.

urbandw311er · 2024-08-21T19:42:22 1724269342

Out of interest, is there no way to rearchitect the whole thing to be event-based, ie more like a producer-consumer situation? Or do you have to loop back through all the source data and poll every hotel to fetch its current prices?

Patryk27 · 2024-08-21T19:47:53 1724269673

Sure, it is producer-consumer - we use Postgres' LISTEN/NOTIFY mechanism (mostly because we have no other use cases for queueing, so "exploiting" an already existing feature in Pg was easier).

The example in article shows all hotels getting refreshed, but that's just because that's the quickest way to reproduce the problem locally. In reality we refresh/reindex only those hotels which have changed since the last refresh - over day(s) this accumulates and the OOM was actually happening not immediately on the first refresh, but after a couple of days (which is part of what made it difficult to catch).

akira2501 · 2024-08-21T20:31:17 1724272277

> Feels like you've just read grugbrain.dev and decided to shoot your golden tips at everybody without actually trying to understand the situation.

This feels like you've taken this personally or are projecting.

I'm not sure if you posted this to Hacker News as some sort of marketing exercise, but this is _Hacker_ News, there should be an expectation that some people are going to take a critical view at your post.

> Because those prices change 24/7/365, million times a day, and so refreshing happens pretty much all the time in the background, eating CPU time. What's more, calculating prices is much more complicated than a hashmap lookup - hotels can have dynamic number of discounts, taxes etc., and they can't all be precomputed (too many combinations).

I get that you have to recalculate things, I'm still not entirely sure how you ended up with 10% of your overhead being in malloc while doing it. That's pretty unusual, and almost everywhere, would be considered a code smell.

> a little trust in others won't hurt.

That's not why I'm here and I'm very sure that's not why you've posted this here either.