My experience with out of memory is that in every single language and environment I worked on in the past, once an application hits that condition, there is very little hope of continuing reliably.
So if your aim is to build a reliable system, it is much easier to plan to never get there in the first place.
Alternatively, make your application restart after OOM.
I would actually prefer the application to just stop immediately without unwinding anything. It makes it much clearer as to what possible states the application could have gotten itself to.
Hopefully you already have designed the application to move atomically between known states and have mechanism to handle any operation getting interrupted.
If you did it right, handling OOM by having the application drop dead should be viewed as just exercising these mechanisms.
You can increase the probability of being able to survive an OOM condition long enough to emit diagnostic info in most memory-managed languages (java, python, etc.) fairly easily. Allocate (and write data to) some global fixed size chunk of ballast memory at program startup, and whenever an allocator failure is detected immediately free that block and emit whatever diagnostics you need. simcop2387 pointed out that Perl even supports this as an interpreter setting; in other languages I've done it manually.
This is more useful than OOM'd programs silently disappearing, but isn't foolproof (that's why I said "increase the probability" rather than "guarantee"): if the OOM killer gets invoked before your telemetry, if you've forked, or if whatever your language does in order to even detect an allocator error is itself memory-costly (looking at you, Python--why on earth would you allocate in order to construct a MemoryError object?!), you will still go down hard.
Other than going back in time and reversing Linux's original sin (allocation success is a lie), swap is the "solution" to these situations, but for many people that cure is worse than the disease.
I often wish there was a generally-available way to map memory to files (rather than mmaping files to memory) selectively in my programs. For cases like this--handling OOM conditions and doing cleanup/reporting before turning out the lights, when the OOM condition was caused either by code in my program outside of my control or other programs on the system outside of my control--having a "all allocations after this instant should occur in a file on disk, not memory" bit to flip would be nice. I don't need my cleanup to be fast; if this is happening I'm already on a sad enough path that I'm happy to trade shutdown/crash performance for fidelity of diagnostic data and increased probability of successful orderly cleanup.
Of course, not being an OS developer, I assume this is impossible for reasons outside of my understanding; the few complicating factors I can think of (copy-on-write pages, for example) are hard enough, and I'm sure there are others.
Strongly disagree. First of all, we're talking about miniscule (usually Kb or less) amounts of memory here. Secondly, I'm not talking about over-allocating so programs can run in perpetuity; rather, I'm talking about keeping some small ballast around so cleanup and reporting can happen right before my program crashes/halts (think: the kind of code you'd put in atexit hooks). That's a better experience for operators and product maintainers than "your app disappeared without a trace because the OOM killer came for it" or "your app disappeared without a trace because its ENOMEM behavior was abort()".
This is absolutely not why Java had/has a bad rap memory wise--that has more to do with a combination of code that allocates irresponsibly on the happy (not memory-error-anticipating) path, and the JVM's preference to preallocate as much as possible for every purpose, not just OOM handling.
> Just make use of proper libraries that will invalidate caches if needed
What if I've dumped every cache I can and I'm still getting allocation (or spawn, or whatever) errors because the system is out of memory? It's useful to have a contingency in place to turn out the lights room-by-room rather than cutting power to the whole building, as it were.
Nope, just crash ballast. I understand what you're condemning though: it's super regrettable when ordinary utilities, daemons, or desktop apps preallocate like like they're database servers on dedicated hardware.
I have the same experience in C applications, but not in Rust. Rust's problems are IMHO purely self-imposed from unnecessarily nihilistic assumptions.
• There's no problem of "untested error paths". Rust has automatic Drop. Cleanup on function exit is written by the compiler, and thus quite dependable. Drop is regularly exercised on normal function exits too.
• Overwhelming majority of Drop implementations only free data, and don't need to allocate anything. Rust is explicit about allocations, so it's quite feasible to avoid them where necessary (e.g. if you use a fat error type that collects backtraces, that can bite on OOM. But you can use enums for errors, and they are plain integers).
• Probability of hitting OOM is linearly proportional to allocation size, so your program is most likely to hit OOM when allocating its largest buffers. This leaves a lot of RAM left to recover from that. It's basically impossible to exhaust memory up to the last byte.
Crashing may be fine for tiny embedded software or small single-threaded utilities. However, crashing of servers is expensive, and may not even work. When you crash, you kill all the requests that were in progress. This wastes work already done by other threads, and you're not making progress. When clients retry, you're likely to get the same set of requests that lead to OOM in the first place, so you may end up crashing and restarting in a loop forever. OTOH if you detect OOM and reject one offending request, you can keep making progress with other smaller requests.
Compiler can only do limited number of things and I am not talking about it.
Think in terms of trying to save the file you are working on before the application exits. Or notify cluster map that the node is not going to be available to process requests.
You don't exit. You work with fallible functions, and keep trying to handle errors for as long as possible.
Each function can fail, and when it fails, you propagate the error up to a point where the whole failing task can be gracefully cancelled.
e.g. if user invokes a "Print" command, and it runs out of memory, then instead of immediately crashing and burning, you can try to report "Sorry, print failed". That too will be handled fallibly, so if `message("Sorry…")?` fails, then you proceed to plan B, which may be log, then save and quit. If these fail, then finally crash and burn. But chances are that maybe print preview needed to allocate lots of memory, and other functions don't need as much, so your program will survive just by aborting that one operation.
Build perl with -DUSEMYMALLOC and -DPERL_EMERGENCY_SBRK, then you can preallocate a buffer by doing $^M="0" x 65536; then you can trap the out of memory condition with the normal facilities in language and handle it appropriately (mostly letting the big data get deallocated, or exiting). Then you can continue on just like normal. It's a weird setup and I don't think I've run into any other language with that built in.
The interesting thing is that the OOM killer doesn’t always go for the program that triggered the OOM. It may also decide to kill another memory-hungry process (cough database) on the machine unless you explicitly tweaked it.
I’ve had t he OOM killer kill my work queue because someone wrote a shell script that leaked small amounts of memory. The entire purpose of that machine was to run the queue, so most of the memory was allocated to that process - making it the prime target for the OOM killer. Yes, please, kill everything else on that machine, before touching that process. (Indeed, the OOM killer has a setting for that)
Sounds bad, but it doesn't justify the proposed heuristic to kill every new process that fails to alloc without looking at the whale that's responsible for the OOM condition. Actually, it sounds like you didn't properly configure that machine.
You might be happy to learn that the OOM killer can be configured[1] to specifically protect certain processes. If the entire point of a machine is to run a single process, then you should definitely use that feature.
I mean, the OOM killer's heuristics are byzantine to be sure. However, if your program is not likely to be the "true" culprit of memory exhaustion, there are better tools at your disposal than ballast pages--cgroups and tunable oomkillers like earlyoom (https://github.com/rfjakob/earlyoom).
On the other hand, if you are likely to be identified as the culprit, I think the best you can hope for is getting some cleanup/reporting in before you're kill-9'd.
I think you've identified the primary exception to the parent comment's rule. Obviously we would not want to kill PID 1, no matter the reason for the OOM.
In this case (cybersecurity application), there is a complicating factor: you cannot trust that the environment is trustworthy. The out of memory condition may have been caused by malware to disable/degrade the antivirus sensor. For example, the malware allocates just enough memory that further OS allocations fail, but it has enough memory to accomplish its task then terminates to allow normal system processing to resume.
Under that scenario, the antivirus sensor should be able to take some action (log, likely) that a malloc failed, and possibly even try to recover memory and identify the risk.
I did it at least once for a credit card terminal.
The application saved state to internal database, and even if you cycled power it would just come back to the same screen and same state it was before power cycle.
This wasn't to deal with memory problems (in fact, it had no dynamic memory allocation at all) but rather to deal with some crappy external devices and users that would frequently power cycle things if it didn't progress for more than couple of seconds.
Why do so many developers assume that they are the perfect ones who never make mistakes? They think that their programs have no bugs despite the overwhelming evidence that all programs have bugs. But no. Not theirs. Literally all other programmers are idiots and only they write perfect code.
I am not sure how you can make a user friendly application like that.
How about an image viewer that tries to open too large an image, should it just crash when it OOMs? I would much prefer an error dialog and the program continues to run.
For big infrequent allocations (like your example of loading a huge image) it is easy to use a non-default allocator that returns an error rather than aborting.
As the article notes, this is all moot if you are running on Linux though. The allocation will always succeed but if it's too big the kernel will start killing processes, quite possibly including your image viewer.
In Linux allocations can fail in some situation, for example in case it's set a resource limit on memory, because it was set a memory rlimit for the process, or the cgroup that contains the process runs out of memory (so if you run the program in a container with constrained memory).
You can look at the SQLite codebase for good examples of this. Essentially, you will have to introduce a fixed upper bound for the file size you are willing to handle, given your minimum system requirements. This way, you can actually test that your program gracefully handles OOM conditions.
Fixed upper limits are helpful but do not prevent OOM. Perhaps the system has very little free memory due to other programs or limited hardware or other reasons but your program is intended to load large files on systems that have the memory so the upper limits are pretty high.
It's takes very little memory to show the user an error message it will almost always succeed even if the operation that triggered failed due to OOM.
Of course, it's not a guarantee. But if your program is under heavy pressure from other programs, it won't continue to run in any reasonable sense either, even without Linux's OOM-killer. Yes, you can show an error message, but then the only safe thing to do is quit. You won't even be able to clean up any on-disk swap files properly, let alone handle user interactions.
Handling this is hard in desktop applications. For servers, you can have known workloads with better limits on each process.
That is what Linux's OOM killer does, yes. In Linux you can handle the problem by spawning off a subprocess and watching to see why it dies. Not an acceptable result for your antilock braking system firmware.
Why would users be aware what happened? Did you tell them?
This is a very reliable and easy for the user and other developers to understand.
try
{
image = LoadImage(path);
}
catch(OutOfMemoryException e)
{
Msgbox("Cannot load image it is too large for available memory");
}
That is way more friendly than a program crash and allow the user to try again perhaps with a smaller version of the image because they accidentally picked the high res version or something.
Either way you may get 100% CPU as the program crashes and memory gets reclaimed, or the program continues to run and the garbage collector reclaims.
In a GC'd language which is what I am used to, the GC will pause execution and reclaim on the next allocate if there isn't any available memory cleaning up whatever was allocated on the attempted image load.
After that if there is still not enough memory for the simple message box then you have an uncaught exception and the program crashes and you're just back to the no catch approach, nothing lost. Most likely though there will be enough memory for the message and you have a much more friendly result.
I have used this approach before and it is way way more friendly then a crash with users potentially losing work because they accidentally picked the wrong file.
> After that if there is still not enough memory for the simple message box (...)
That's exactly the point. Once memory is exhausted, you can't take any action reliably.
You don't build reliable applications by making mechanisms like: "ok, if memory ends lets design it to show a box to the user, we have 50% chance this succeeds".
It would be better to wrap your application in a script that detects when the application quit and only then shows a message to the user.
People designing things like you drive me crazy. They come up with a huge number of contingencies that just don't work in practice when push comes to shove.
Stuck in a loop trying to show a widget, using 100% of CPU and preventing me to do any action?
>That's exactly the point. Once memory is exhausted, you can't take any action reliably.
Nothing is 100% reliable, thats not realistic, and in this kind of situation its is not a 50/50 shot, its more like 10000/1 that you will be able to show a message, that should be obvious.
>It would be better to wrap your application in a script that detects when the application quit and only then shows a message to the user.
Thats simpler then wrapping the specific function in a "script" that shows message to the user but allows the program to continue to function in almost all situations?
>People designing things like you drive me crazy. They come up with a huge number of contingencies that just don't work in practice when push comes to shove.
People like you who do not value you the user experience over pure code drive me crazy. These things actually do work in practice, I guarantee you any sufficiently complex GUI program will have code like this to try and gracefully handle as many contingencies as possible before simply crashing. Do you think your browser should just crash losing other tabs if a web site loads too much data? Does Photoshop just crash if it runs out of memory on an operation losing all your work?
>Stuck in a loop trying to show a widget, using 100% of CPU and preventing me to do any action?
Its no more stuck than your process crashing and the kernel is reclaiming memory, they both take similar cpu time, one however results in a message that informs you of what happens and leaves you with a running program, the other tells you nothing and your program is now gone.
A program that can handle an OOM and continue to function normally is more reliable to the user than one that crashes and must be restarted potentially losing work.
Memory allocation can fail, network connections are not reliable , opening a file may fail, writing a file may fail and so on. If your program simply crashed because some operation failed it would only be reliable at crashing.
If you try an operation that causes an OOM by say allocating a large amount of memory as the example given above, then that memory is freed you are now functioning normally again.
If you write a large file to disk filling it to capacity, failing then delete the file, now you have free space again and everything is normal.
This is the kind of advice that I put in a category of "easier said than done".
The trouble is that you may not know beforehand what exactly is going to be needed. And you might need maybe a library call and the library does dynamic allocation in it and you either have no idea about it (until you find out the hard way) or no way to help it.
So in the end maybe you can take some extremely simple action like writing something to log or show a widget, but that's about it.
Just code as normal and when you run into a problem, reassign the default allocator for the struct/object? Not possible for many PLs, hard for some but basically trivial (1 LOC per struct/object) for others.
Some PLs come with test suites that give you a failing allocator, so you can even easily test to make sure this looping condition resolves sanely.
Preallocate that? If it's modal, you should never have more than one, and the allocator associated with the msgbox should know to use that scratch space instead of fetching new memory.
It depends on the application. On Android, for example, I don't want my app to crash with an OOM while trying to load an image. I want to instead clear my in-memory cache and try again with more memory available. So I wrapped the code that loads images into a try-catch that catches OutOfMemoryError and it worked wonderfully.
The ability to "handle" OOM in userspace, exists for a particular class of software — software that is usually configured to 1. use unbounded and unpredictable amounts of memory, where 2. requests are entirely-sandboxed in their resource usage, but within the software itself, rather than at the OS level.
Basically there's only one kind of software in this class: DBMS software. DBMSes want to be able to try to process ridiculous queries if users ask them to; and then fail in a way that only affects the processing of that query, rather than the stability of the DBMS as a whole. And they also mostly can't afford the overhead of pre-calculating just how ridiculous a query will be, before attempting it; because that calculation often requires effectively 90% of the work involved in actually running the query.
For every other type of software, letting the OS handle the OOM (by killing your process) — and setting up your higher-level inter-process / inter-node architecture to be resilient to that — is the sensible approach.
Why are you ignoring the obvious examples of operating systems? An Operating System doesn't and shouldn't crash if you run out of memory.
Additionally I have personal experience with network appliances (firewalls/deep packet inspection/etc). If such a device runs out of memory, it degrades gracefully (unless there's a bug) and starts shedding connections rapidly to avoid running out of memory. Such devices can't just restart if they run out of memory as that would be a network outage and can often be exploited to cause a more serious denial of service attack.
Rust is a system's language. Handing out of memory conditions is par for the course for systems programming.
FWIW, this is a little outdated now. If you're willing to bite off nightly, there's cfg(no_global_oom_handling) which will just remove access to the aborting allocation API. It came out of the Linux kernel work and their very similar concerns IIRC.
If you cfg(no_global_oom_handling) lots of nice things go away. Which is both appropriate - those nice things did in fact depend on global OOM handling - and hopefully likely to keep more people from erroneously believing they can't afford global OOM handling.
As an example, ordinarily on Rust this does what it looks like it does:
let mut lyrics = "Never was".to_owned();
lyrics += " a cornflake girl";
With cfg(no_global_oom_handling) this can't work, most importantly because the second line is an arbitrary concatenation and therefore allocates, but there is no possible way to signal if it fails.
As an embedded C guy this makes me so happy. Yes, that second line allocates and the fact that it is hidden makes me nervous. I want a programming environment that exposes all of the complexity and makes me deal with the consequences.
You can't push_str() under cfg(no_global_oom_handling) either
I mean, they did warn you: No Global OOM Handling. So, if this mustn't fail, and it might not succeed, therefore it has to be eliminated, a String under cfg(no_global_oom_handling) can't push_str() and it can't reserve() and it can't do a lot of things. Their definitions are conditionally removed from the type.
I didn't check, it's possible "something".to_owned() is similarly forbidden, seems like that would allocate memory.
Basically if you live in a world where allocating memory for strings, vectors, and other growable types feels extravagant, cfg(no_global_oom_handling) is for you, and if not then maybe you should re-evaluate why you are worried about allocation failures when you are wasting precious heap memory on such data structures.
rust-analyzer could probably do it. That information isn't exposed in type signatures, though, so it'd have to have full access to the source of everything and a good understanding of the standard library (and would fail if you added your own “might allocate” by going around the back of the global allocator).
If you're allowed to have false positives, it's not so hard. Like "this function allocates if A or B is true" just becomes "this function can allocate". It's not so different from finding functions that can panic.
I’m fairly certain that this does not require solving the halting problem. Fundamentally, every line of code can be transformed to the underlying code constructs, and those could be checked for allocations. Doing this in an efficient manner is probably hard, but certainly not impossible
I think if we wanted a concrete answer you’d be right.
But I feel like a static analyzer could clearly identify “this line can allocate” just by knowing what language features can allocate. And for third party library methods “does this function have anything inside of it that could allocate?”
The zig approach seems really well suited to low-level libraries where you really need to handle these errors (and custom allocators, etc.) but it's kind of annoying for general purpose programs.
"where the C++ pieces can all recover from OOM and the Rust component cannot"
I would like to read about their successful test of their C++ pieces using the same (random fail small allocations) strategy. My impression from Herb Sutter is that on the popular C++ Standard Library implementations (for MSVC, Clang, GCC) this actually doesn't work, but of course Herb has a reason to say that - so it'd be interesting to hear from somebody who has a reason to believe the opposite.
Why not budget how much memory your rust component needs, and allocate a pool ahead of time? This is how it’s sometimes done in embedded systems. It depends on what your rust code does of course, but if it’s runtime memory use depends on untrusted data it is handling that is a concern in itself.
I'm not sure if this can apply to the most commonly used software. A browser can't know in advance how much memory it would need. Same thing applies for audio, image or video editing software. Probably office suites too.
This might be feasible with some limited-in-features apps, like an audio/video player, an IM client, etc...
It won’t. I was thinking more about backend components where you allocate a fixed per request memory budget. It would work best if you could apply the budget per in-flight request. In case a request goes over budget, only the affected (presumably nefarious) request gets terminated, other requests and other processes will run uninterrupted.
Most commonly used software such as browsers can abort on OOM, this isn't about those cases at all.
In conditions that you do not want to abort, usually you can and would want to allocate pool before hand.
All this is moot for me since I only use systems with overcommit enabled.
Is there a way to disable it for a single program or cgroup, to enable it to deal with out-of-memory conditions? Maybe changing/hooking the standard library?
mlockall(2) seems overkill, since it will also force all code and mapped files to be resident.
It's bonkers to allow programs to destabilize the whole machine to the point that the kernel has to start killing off processes just to survive.
I use https://lib.rs/cap which self-imposes memory limit on the process. Setting that limit below cgroup limit allows programs to actually handle OOM before they get killed by the OS (if only Rust's libstd wasn't so eager to self-abort anyway).
In case of servers/vm/containers you probably know how much RAM you've provisioned, so set it to that minus whatever other processes need to live.
For desktop applications it's tough. It may be just an arbitrarily high amount you don't expect to hit during normal operation. If you need to work with variable-size data, then it could be `size_of_file_being_opened * x` if you can predict the `x`.
I suppose I could use a custom allocator that calls mlock() on each page. I will look whether the ones you mention have Rust integrations and whether they expose a setting like that.
I understood wyldfire's post to mean counting & imposing your own artificial memory limit, no mlock involved. But I wouldn't know how to pick that limit :/.
Oh I see. Well I'm not trying to add a limit to my RSS, I just want to be notified when allocation fail in my process on my machine where overcommit is enabled.
Maybe what I'm asking for makes no sense though, because even if my process handles out of memory errors gracefully, it might still get OOM-killed when another process allocates some more.
Yes that is what I meant. And it would avoid the latency hit of a system call to satisfy an allocation. However doing an mlock would avoid the latency hit of paging, so that's good too.
In a way, it's counterproductive to limit VSZ. It's perfectly fine to mmap() gigabytes of files to virtual addresses, but you neither expect nor want to have that counted against your "memory usage" limit.
The entire problem is, in a way, unsolveable with current OS APIs. AFAIK there is no preexisting, good, actually usable, and universal memory usage meter. Some things work for a lot of cases, but I don't think there's anything that would work universally.
Coincidentally, in my eyes the best way to handle memory pressure in applications would be proactively rather than reactively. Sometimes you can unload more things, like pages in a document that aren't being edited right now. And if you need to fail, you can fail on good boundaries (e.g. refusing entire requests in server-like things) rather than in the middle of something that you need tons of work to unwind correctly.
So if your aim is to build a reliable system, it is much easier to plan to never get there in the first place.
Alternatively, make your application restart after OOM.
I would actually prefer the application to just stop immediately without unwinding anything. It makes it much clearer as to what possible states the application could have gotten itself to.
Hopefully you already have designed the application to move atomically between known states and have mechanism to handle any operation getting interrupted.
If you did it right, handling OOM by having the application drop dead should be viewed as just exercising these mechanisms.