Hacker News new | past | comments | ask | show | jobs | submit login
Non-uniform memory access meets the OOM killer (rachelbythebay.com)
109 points by r4um on March 31, 2018 | hide | past | favorite | 69 comments



The OOM uses a heuristic to figure out what to kill. If the primary purpose of your system is to run some process that hangs on to a lot of RAM, that heuristic is exactly the opposite of what you need, so it would a good idea to disable it or exempt that process.

Also, while I'm talking prophylactics, if you have (which you should) monitoring and alerting in your production environment, it seems like there should be an alert for whenever the OOM killer activates. Assuming you are allocating resources carefully enough that you expect everything to fit, if it fires, it's almost always a sign that things are not going according to plan and need to be investigated sooner rather than later.


Yes, the OOM killer activation is on its face evidence of an incident; comments on other threads saying “If your process ran out of RAM, you get to quit. Why offload it on some other random process? This is how your database process runs out of memory, and your web workers get killed (or vice versa).” The scenario depicts a capacity shortage regardless of what the system decides to do in response.


A lot of time seems to go into tricking the watchdogs on single purpose machines. I heard a story once of a guy who wanted to get some computation done, but the process was being deproritized by the scheduler because it seemed like it was a hung process that kept asking for CPU time. The solution he came up with was voluntarily relinquishing compute accesd right before anyone would would check up on it, making it appear as if the process was great at sharing time with others. By doing this, he could get that one process’s instructions running something like 99% of the time.


That's a pretty odd workaround considering that you can reserve cores to the point where not even kernel tasks run on them and then pin a single userspace thread to that core so it can run without ever being preemptively descheduled


For anyone like me who hadn't heard of this possibility before, I'll save you a few seconds of googling: https://stackoverflow.com/questions/13583146/whole-one-core-...


This wouldn't even work with the completely fair scheduler, which is the Linux default scheduler.

https://en.m.wikipedia.org/wiki/Completely_Fair_Scheduler


CFS obeys core isolation and task sets, so it would work


I was speaking of the "odd workaround", not about using cpu isolation.


> The solution he came up with was voluntarily relinquishing compute accesd right before anyone would would check up on it

I can't parse this; could you clarify? What exactly do you mean by "relinquish compute access right before anyone would check up on it"?


I can’t figure out whether it was me who wrote the words “compute access” or whether it was meant be something else and was autocorrected into that. Either way, what I meant was that the scheduler would normally come around at some fixed time, say after 1 ms, and task switch to something else if it found that the original process was still executing. Instead of letting this happen, he paused execution at 0.99 ms and told the scheduler he was done with his work. Then, he immediately resumed the process, which caused the scheduler to return control back to him.


That's a pretty bad scheduler then because normally it would have moved that task to the tail of the 'ready' list for that particular priority level rather than to the front because it yielded voluntarily.


IIRC there are schedulers that dynamically adjust priority based on recent CPU usage. When a task is removed from the CPU due to its time quantum running out (as opposed to, say, making a blocking system call), that is taken as a sign that the task is relatively CPU-intensive, and its priority is reduced a little.

The idea is to give IO-bound tasks a little more priority in comparison. IO-bound tasks generally run faster if they can get another IO operation started ASAP after the last one completes, and since they're not going to hog the CPU anyway, giving them relatively higher priority is one way to do this.

Anyway, the point is that the trick works by gaming the system of priority levels. If you're at the tail of one priority level, you may be off than at the head of another.


Ohh huh wow, I'm surprised the scheduler would fall for that!


Setting niceness to minimum wasn't enough?


>This new version also had this wacky little "feature" where it tried to bind itself to a single NUMA node.

This is 100% a feature. If you care at all about memory access latency, you want to remain local to the NUMA node. Foreign memory access is significantly slower. If you have NUMA enabled and your applications are not NUMA aware, and there are shared pages being access by applications running on both nodes, the NUMA rebalancing can actually cause even worse performance as it constantly moves the pages from one node to the other.

Any application that cares about memory access latency should 100% be written to be NUMA aware, and if it is not, you should be using numactl to bind the application to the proper node.

This also goes for PCI-E devices (including nvme drives!) as they are going to be bound to a NUMA node as well. If you have an application that is accessing an nvme volume, or using a GPU, you should 100% make sure that it is running on the same node as the pci-e bus for that device.


Time to re-up this classic from Andries Brouwer:

https://lwn.net/Articles/104185/


It's not commonplace for even medium size companies to run dozens of servers. Memory resources (as well as disk and CPU) are always being stretched. The OOM may have been sufficient for single server environments, where you could always provision an extra 40%, but it's far too blunt of a tool.

Most environments I've worked with have to define an instance size (in memory and CPU), and determine how many parallel threads/processes will run on it. Plus you need to determine when and how to scale up to more instances. To reduce costs, the goal is to 100% utilization, but also with the capability deal with spikes in traffic an workload, and all with an acceptable error rate.

Unfortunately, doing this type sizing/scaling analysis is incredibly difficult. The opaque effects of the OOM make it even more difficult. I'm sure the OOM uses a deterministic algorithm, but it's complex enough that most don't know it, or handle for it. In a server environment, if the OOM kills a service, your app and all other services are likely hosed. It would be far more preferable if the OOM had a straightforward, consistent, and deterministic method to dealing with low memory. This way programmers would know to look out for it, and could handle it more consistently.


The OOM killer was a misfeature when it was designed. Why is it still in the kernel? Solaris solved this problem 20 years ago.


Overcommit is desirable because of the fork/exec model of creating processes; if a big one wants to spawn children, without overcommit you need to be able to recommit its VM size, regardless of how lightweight the desired child is.

Sadly there is no good (& simple) solution under most Unixes (that's a problem that only affects long running and big processes, but those are often the main applications of some embedded systems). I'm not really found of vfork now that we use threads more often (well, at least on Linux it seems other threads are not suspended, but it seems that this is not a Posix guarantee)... posix_spawn could be a solution if implemented with some dedicated kernel support (or always implemented through well-behaved existing ones).


> if a big one wants to spawn children

Should this be considered poor behavior or design?

It's been quite a while since I wrote explicit fork/ exec code, but wouldn't a better approach be to have a small master process that spawns off the necessary children and then either links them up or mediates communication?

I mean, on a unix-like system, init is ultimately the spawner of everything else and it's not a particularly large process.


Yes on Unix it can be interesting, if needing fork/exec to spawn children, to pre spawn an helper child process that will do just that. But that would just be because the fork/exec model is insane: with a posix_spawn (or CreateProcess) centric approach, you don't have to add that extra layer, and you have very few drawbacks. Also, that will not help you when trying to use libraries that attempt to spawn on their own.

As for init being small, I leave you the responsibility of your words, now that we live in a systemd world :p


You can do this:

vm.overcommit_memory=2

And then the kernel will no longer do overcommit.


Indeed. But the fact that the kernel does it at all is mysterious, and that on most systems the misfeature is enabled by default...


The reason it is there is that the C memory allocation model tends to ask for rather more memory than the program actually needs. This leads to inefficiency if all that unused memory is backed by physical memory. So this overcommit trick was created to allow other processes to 'steal pages' from processes that have allocated memory but have decided not to use them for now. In practice this works reasonably well, but there are some nasty edge cases when one or more of those programs wake up and then decide to actually use the memory they think they already have.

For systems where deterministic behavior is more important than flexibility (for me: all of them, for others: their choice) you're better off disabling overcommit.


I don't think there is no such thing as a "C memory allocation model". There are various ways of requesting memory from the OS. But not requesting anything in advance is going to be very slow in any setting that requires a context switch or virtual memory.


A typical memory manager will scale up the size of the requests to the OS whenever it runs out of space.


Pardon my ignorance: How did Solaris solve this?


I just read http://unix.derkeiler.com/Newsgroups/comp.unix.solaris/2008-... and it seems that they did not solve any problem at all; they don't allow overcommit, that's all. But they do not do anything else to compensate for the disadvantages this implies (this also have advantages, but not only)

You can do that by tuning Linux to have this behavior if you like it. In a Unix system, I don't thing this is the most desirable behavior in the general case...


I can't remember but I did sit in on a presentation about 15 years ago where they explained it. I lent the notes to a senior developer and never got them back.


It doesn't have an OOM killer. Even more remarkably a call to allocate memory can't fail, but it may not return either. When Solaris (well, SmartOS in my case) runs completely out of memory, all hell breaks loose.


Not every system has virtual memory, in particular Android uses the OOM killer pretty aggressively to manage transient applications.

In that case OOM killer is a pretty nice solution compared to the alternatives.


That's why one needs to be aware of memory and other load characteristics of system, particularly if it is an enterprise system. Various process should be put in different cgroups with defined resources. cgroups also provides memory pressure notification and other goodies too. If it is an embedded system, probably it is best to turn off overcommit. Finally, for critical processes, use oom.priority so that process can be excluded from being killed.


This is the reason i am big fan of running any software with separate users and setting ulimit to a low value so that something stupid like this cannot impact the production service. I would be super keen to try to replicate this scenario on my test cluster and see if my settings catching it. Does anybody know if the software in question is an opensource tool?


This is the approach I take also. I'm also looking at totally disabling the OOM killer because it seems to be pretty useless. Anytime I see stuff killed by OOM the culprit is usually and obviously some runaway Java process, but OOM inevitably picks the SSH daemon to kill, which doesn't help anything, and the box continues to swap so badly that it just seems unrecoverable. I'd rather just have the box panic and reboot if it's truly out of memory.


I have not looked into it at all, but can you not exempt sshd from the OOM killer?


I looked into it a little bit. There are ways to tune it but I didn't see a way to exempt processes by name. It may be possible.

The scenario I described above is HPC clusters in a university environment. The problem is students running programs that are poorly written. I'd rather reboot the node and tell them to fix their code than deal with trying to accommodate their careless / naive programming.


At my last job I wrote a build system that build maybe 30 or 40 executables from several hundred source files. Sometimes when I'd run make -j with no constraint my desktop environment would crash.

It turned out that the OOM killer was triggering because I was filling up memory with compiler invocations.

I was really proud of that bug.


I seem to end up building a lot of stuff on memory constrained devices for some reason. The OOM killer is always a problem, but it's easily avoided by provisioning an excessive amount of swap. It's slow, but slow is faster than never. Did you have any swap at the time?


Yeah, probably. It was just a desktop Ubuntu system.


This might sound really obvious but you know you can do "make -j12" or some other number, right?


Yeah, I do. I just found it hilarious that my build system could take down my window manager.



What happened to good old returning NULL when no memory is available?

No let's do overcommit (malloc always works) and OOM-kill some random process when under memory pressure!


It's not the case that malloc always works under Linux. It may return NULL if the memory is not available. Try it out! Write a program that allocates half your memory, fills it with 0xFF, then sleeps. Run two instances of it. The malloc will fail in the second one.

According to [1], the OOM killer is a consequence of the fork syscall, and can't be removed without breaking backward compatibility.

[1] https://drewdevault.com/2018/01/02/The-case-against-fork.htm...


The man page for fork indicates that it can fail and return ENOMEM just fine. Technically, it's only supposed to be returned in special conditions where the kernel doesn't have enough memory, but this seems like much less of a breakage than going around killing random processes.


It certainly is permitted to fail by spec. The goal of the entire overcommit / OOM killer business is to yield better real-world outcomes.

If I have overcommit off, and a program that allocates more than half of my physical memory (RAM + swap), it can't fork - even if it's going to immediately exec a tiny program. If I have overcommit on, it can fork and exec, and nobody gets OOM killed.

The tradeoff is that, if I have overcommit on and the child process starts modifying every page instead of exec'ing, then the OOM killer triggers. The bet is that a) probably the child process will be killed, since it has the most physical memory in use (assuming the parent process has not touched every page it has allocated), so this isn't worse than preventing the child process in the first place, and b) this situation is rare.


The point is more that processes holding on to a lot of memory often fork and exec and don’t actually need 2X their current allocation in-between.

Yes, there are alternative APIs that avoid the need to temporarily hold on to that memory in both processes, but the idiom is still incredibly common. And it’s not unusual for server-style processing to fork without an exec and largely share their parent’s allocation via copy-on-write. That defers allocation until there’s a page fault on a simple memory write, which doesn’t have a particularly helpful mapping to C if you want to return failure to the copying process.


You can turn off overcommit.

The basic idea isn’t that having malloc return NULL on failure is worse than killing a random process when any process runs out of memory. The idea is that programs often ask for memory that they don’t need, or don’t need right away.

When a program actually uses too much memory, there isn’t a way to signal failure, so the OOM gets involved. If this is wrong, turn off overcommit. Unfortunately, overcommit is a system-wide setting, so individual programs can’t opt in or out.


Oh but there actually are ways to signal failure (flushing caches, etc.), and there probably could and should be even more, but still the OOM killer is "needed" as a last resort measure.

I recently saw an interesting Windows VM feature, where a process can indicate that some pages can be sacrificed to the OS if needed, and another function to try to get them back unmodified if by chance they have not been. I don't know if similar Linux syscalls exists, but I found the principle interesting.


You can get a similar effect on recent Linux versions with madvise() with MADV_FREE, although the interface for determining if the page was freed is a bit awkward: the page will be appear as zeroed if the OS decided to free it. Depending on what you are storing there and whether zero data is valid it may be awkward to detect.

You can cancel the free request by performing a write, which is also a bit of an awkward API.


We collectively decided that this is less annoying that introducing thousands of difficult error cases to handle in every application.


How is it difficult to handle?

    void* xmalloc(size_t s) {
        void *p;
        if (!(p = malloc(s)) {
            panic(“OOM”);
        }
        return p;
     }
If your process ran out of RAM, you get to quit. Why offload it on some other random process? This is how your database process runs out of memory, and your web workers get killed (or vice versa). In either case your system isn’t usable, but one is harder to debug.

In fact having the above in stdlib would have been just as effective at fixing thousands of little bugs and would have collectively saved us all thousands if not more man hours of developing the OOM killer and then raging online about it.


Because it's not necessarily your process that ran out of RAM. Imagine you have background process A which has allocated 1 GB of RAM just in case and never used it, and then process B tries to allocate maybe 100 MB, and the kernel says "No, I can't do that." Then process B quits. Is this really a better outcome than killing process A?

On my current laptop, pulseaudio has 2.7 GB virtual memory and 17 MB physical memory. If I open a new Firefox tab, I would much prefer for pulseaudio to die than for Firefox to die.

(You might say, "Well, pulseaudio shouldn't allocate 2.7 GB virtual memory," to which I'd say, "Sure, if you'd rather spend the thousands of hours fixing pulseaudio and every other program like it, go for it.")

Also, "panic" isn't a standard C function - are you envisioning a panic that unwinds or one that doesn't? If it unwinds, what happens if you need to allocate memory during the unwind to save state? If it doesn't, why is it a better state of the world for Firefox to crash periodically and require me to recover my work, without even giving it the option to synchronize its state?


> pulseaudio has 2.7 GB virtual memory

heh, anything compiled with GHC allocates a TERABYTE of virtual memory, good luck running that w/o overcommit


Malloc may very well succeed when there is not enough free memory in the system. You'll only know it when you try to write to memory and the system runs out of real memory to back your virtual memory.

This can happen much later in time.

Memory allocated by malloc is not necessarily there by the time you need it.

Another issue is that a fork will simply copy the task state and set all the pages to 'copy on write' in the clone. That way the fork can execute very rapidly and only when the child process starts writing pages is there some actual memory overhead.


Well, what about adding a new signal (SIGXMEM) with a default action of ignore? If the system is running low on memory it can send this to some or all processes and wait for a little bit to see if things get better.

This is how iOS has handled things since version 2.0: https://developer.apple.com/documentation/uikit/uiapplicatio... > It is strongly recommended that you implement this method. If your app does not release enough memory during low-memory conditions, the system may terminate it outright.


See section 11 "Memory Pressure" of https://www.kernel.org/doc/Documentation/cgroup-v1/memory.tx... - there's a way to get notified via eventfd() if your current cgroup's memory gets low. I believe you can just do this on the root memory cgroup (/sys/fs/cgroup/memory/memory.pressure_level) if you're not setting up actual cgroups for your application.

(Signals for asynchronous conditions are an awkward interface because they can interrupt you between any two assembly instructions. You're not able to release memory in the handler itself; you have to set a flag that gets handled by the main loop. So eventfd makes sense here. I'm assuming iOS is doing something similar by queueing an Objective-C method call. Signals make a lot more sense for segfaults and the like, where you're being interrupted at the exact instruction that isn't working and you need to handle it before executing any more instructions.)


It's a good idea.


That doesn't work for the stack.

Consider a recursive function. The amount of memory it could need is essentially infinite, though obviously stuff breaks when the address space wraps around or when the stack collides with a different allocation.

So, do we just prohibit recursion? Do we need to have a privileged compiler (making signed executables) that refuses to generate code unless the stack usage can be proven at compile time?

Pretty much, if recursion is supported, you have overcommit on some level. For example, if we force developers to specify a stack size, we still have a form of overcommit. We've just moved it to the developer making promises that he probably can't verify to be safe.


I don't think I'd call that "overcommit." Overcommit is when there's more virtual memory (allegedly) available than actual backing physical memory (RAM+swap). You don't need overcommit to implement a fixed-size stack: just put a guard page (a page table entry that is intentionally invalid and reserved) at the end of the stack. If the stack grows into the guard page, you get a segfault. The actual stack pages for each thread can still be physically reserved.

It still does result in killed processes by default - but you can set up a segfault handler with an alternate (preallocated and non-overcommitted) stack which can handle it gracefully, unlike the OOM killer's SIGKILL. Or you can write your code so that each function call checks that stack space is available before using it. Go and Rust both used to do this, via a function called __morestack, because they both used to support segmented stacks. GCC and LLVM both support this scheme. (And for a while Rust still called __morestack for the purpose of printing a nicer error message without having to set a segfault handler: https://ldpreload.com/blog/stack-smashes-you )

But "overcommit" isn't a synonym for "something that kills processes when they exceed their memory allocation."


"Overcommit" is when the processes running on the system have been promised more memory than is available.

If you run an ordinary Linux program, exactly how much memory should be allocated to the stack? That question is essentially unanswered. At the ABI level, considering just the ELF binary itself, there is no limit in place. (no fixed size) This is a promise to provide unlimited memory.

It is obvious that giving unlimited memory to every running process is not possible.

If we were to eliminate overcommit and attempt to boot the system, the init process could not legitimately start. That process implicitly requests an unlimited amount of memory for the stack, which is obviously unavailable, so it can not be started.


> If you run an ordinary Linux program, exactly how much memory should be allocated to the stack? That question is essentially unanswered.

Not at all:

    (initramfs) grep stack /proc/1/limits
    Max stack size            8388608              unlimited            bytes     
    (initramfs) grep stack /proc/1/maps
    7ffca6fb3000-7ffca6fd4000 rw-p 00000000 00:00 0                          [stack]
init can request more than 8 MB if it needs it, but the binary starts with 128 kB allocated (and more faultable if you hit the guard page) and an 8 MB limit. It can change its own rlimits, sure, but it needs to do that within its existing 8 MB stack.

It would be perfectly straightforward to permit setrlimit to fail if you request too much of a stack, and disable overcommit entirely. So init can raise its stack by some amount, but not more than the available physical memory on the machine.

See also public header file <linux/resource.h>:

    /*
     * Limit the stack by to some sane default: root can always
     * increase this limit if needed..  8MB seems reasonable.
     */
    #define _STK_LIM        (8*1024*1024)
> At the ABI level, considering just the ELF binary itself, there is no limit in place. (no fixed size) This is a promise to provide unlimited memory.

The ABI also provides no limit on the size of bss, or on the size of the program itself. This is hardly a promise to provide unlimited memory for global variables or program text or anything else. The amount of memory is finite and unspecified by the SysV ABI and left up to implementations.


I am well aware of that 8 MiB rlimit. It's just a default that root can change. It's both too much and not enough.

It's too much because even 8 MiB is excessive without overcommit.

It's not enough because nothing about the ELF binary even bothers to claim that any particular amount of stack is enough.

I really wouldn't call "8 MiB" an answer to how much stack space should be allocated or committed.


> So, do we just prohibit recursion?

This is what MISRA C does

> Do we need to have a privileged compiler (making signed executables) that refuses to generate code unless the stack usage can be proven at compile time?

This is roughly what the Linux kernel does (I believe it's more a set of checks that aren't formally proven)

So yes, if you want to have run a process reliably, you shouldn't be doing any potentially unbounded recursion.

> We've just moved it to the developer making promises that he probably can't verify to be safe.

I don't agree, programs get a pretty large amount of stack space by default, and as long as you aren't doing any crazy amount of alloca or recursion (which is in fact the case for most programs), you'll be fine. They may not be formally verified to be safe, but practically the are. OTOH, the OOM killer can take you down regardless of what you're doing, based on what other processes are doing, so developers can have no assurance whatsoever that their programs won't suddenly die. Anecdotally, I've only even seen stack overflows due to infinite recursion bugs that are caught during development, but I see OOM-related problems not at all infrequently.


You can set ulimits for memory (IIRC max vss, but check bash manpage). Then sbrk and mmap will honor the limits, which will cause malloc to fail gracefully.


Note that this is a limit on virtual memory - the OOM killer triggers on running out of physical memory (or swap). It certainly helps more than not using it, but it's hard to tune exactly, because it ignores the specific thing that the OOM killer is designed to support: applications allocating large amounts of virtual memory and not using it so that no backing physical memory/swap needs to be allocated.


many applications just straight use mmap


Being able to write NUMA aware applications like the one described in the article is a luxury for ALL Go users. The current Go runtime doesn't have any NUMA awareness.

As of today, you can get a two NUMA nodes processor (AMD threadripper 1900X) for as little as $449.


Memory overcommit is the most hostile, idiotic misfeature to ever ship in any mainstream operating system. It's such a great example of why one should pay absolutely no concern to Linus




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: