Mmap(): Now you're coding with portals

lilyball · on Feb 25, 2015

Mike Ash covered the idea of using virtual memory tricks to handle ring buffers 3 years ago, written up as https://mikeash.com/pyblog/friday-qa-2012-02-03-ring-buffers....

Unlike the OP's article, Mike Ash's version uses vm_remap() to remap memory around instead of hitting the filesystem and relying on tempfs to keep the data in-memory. vm_remap() is an OS X API, and I don't know offhand if there is any equivalent on Linux (though I would be surprised if there isn't some way to do the same thing).

devbug · on Feb 25, 2015

ryg too wrote about "magic ring buffers" 3 years ago: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff...

adrusi · on Feb 25, 2015

mmap doesn't necessarily have to hit the filesystem, as there are several ways to create file descriptors that are backed by memory (POSIX shared memory, typed memory objects, for instance).

dividuum · on Feb 25, 2015

Is this the general idea?

    int fd = shm_open("test", O_RDWR|O_CREAT, 0600);
    ftruncate(fd, 4096*2);

    char *part1 = mmap(NULL, 4096, PROT_WRITE|PROT_WRITE, MAP_SHARED, fd, 0);
    char *part2 = mmap(part1 + 4096, 4096, PROT_WRITE|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0);

    part1[0] = 'X';

    assert(part2 - part1 == 4096);
    assert(part1[0] == part2[0]);

    shm_unlink("test");

agwa · on Feb 25, 2015

I've programmed with POSIX shared memory, and yes, that's the general idea.

A very important caveat is that the POSIX shared memory namespace is shared among all processes, so you need to wrap shm_open with a mkstemp()-style algorithm that generates a random name, opens with O_EXCL, and tries again if it fails. Unfortunately it's very easy to mess that up and introduce a security vulnerability.

hawski · on Feb 25, 2015

It would be also prefferable to unlink just after /open/ call.

Wouldn't mkstemp("/dev/shm/tmp-XXXXXX") be enough. As strace tells /shm_open/ is just a wrapper for /open/ with /dev/shm/ prefix.

agwa · on Feb 25, 2015

On Linux, shm_open is implemented with /dev/shm, but that's not the case on other platforms.

lilyball · on Feb 25, 2015

Presumably you have to worry about file descriptor limits even if your fd doesn't hit the filesystem. vm_remap in OS X is nice because it doesn't need file descriptors at all.

pstrateman · on Feb 25, 2015

POSIX shared memory is implemented on linux systems as a mmap'd tmpfs file...

hawski · on Feb 25, 2015

Best way is to use mmap MAP_ANONYMOUS. Or with kernels 3.17+ memfd.

datenwolf · on Feb 25, 2015

The funny thing is, I was asking about a function like memfd over 10 years ago:

https://groups.google.com/forum/#!topic/comp.os.linux.develo...

Animats · on Feb 25, 2015

At first I thought this was for interprocess communication. But it's not. It doesn't even have the locking for multithread use. This is a micro-optimization for a single-thread, single process program.

The advantage gained with all this memory mapping is that you get to avoid an extra copy coming out of the buffer, because the "poll" function returns a pointer into the buffer, not the data itself. Avoiding that copy creates a potential race condition. When "poll" is called, and returns a pointer into the buffer, it advances "head", indicating the data has been consumed. That space is now both available for writing and being used by the caller as if immutable. The code that calls "poll" must be done with the data before anyone calls "offer". You've now created an undocumented constraint on the callers to "poll" and "offer". If someone doesn't know that constraint and modifies the code, it will randomly break.

Is this micro-optimization really worth it? Modern CPUs are good at copying recently touched data.

yew · on Feb 25, 2015

This strikes me as being interesting primarily for the potential impact it has on interface design. There are applications of (this sort of) virtual memory manipulation beyond circular buffers. Performance is one consideration, but not the only one.

Also, this is a demonstration - of course it lacks synchronization mechanisms for multithreading! Any particular application of the principle would be adapted for the context in which it occurred (and hopefully be justified thereby).

(As an aside, that caveat only applies if you begin with a poll function that performs the copy itself. That implementation isn't the only obvious one, especially given a large buffer - though I suppose there's room for disagreement on that.)

jwatte · on Feb 25, 2015

I agree: Magic ring buffers are cool! (We used then in BeOS in 1999!)

Separately, watch the number of mmap segments. Linux kernel uses a tree to manage then, and the O(log N) operations really start to hurt at larger numbers.

yifanlu · on Feb 25, 2015

Does this play nice with caches? I know some systems like ARM allow aliases in the MMU which will ensure cache coherency, but it is system dependent and a lazy implementation would just disable caching and slow down the code.

Tuna-Fish · on Feb 25, 2015

On physically addressed cache architectures, like all x86 implementations, this has no ill effects. With virtually addressed caches like on ARM, this is generally a bad idea.

danbruc · on Feb 25, 2015

It is a neat idea, indeed, but from a design perspective it is pretty bad. A general purpose implementation of a collection should always copy read results to a separate buffer. Otherwise a malicious client could use the pointer to modify the content of the collection or at least, if the memory is read-only, read the content directly potentially bypassing necessary checks.

Further without external synchronization writers may at any point overwrite data still not completely processed by a reader. This could be solved by first just peeking at the data and only after the data has been completely processed performing a read to indicate that this data can now be overwritten. But this obscures the semantics of the operations and breaks multiple reader scenarios because all readers will see the same data until the first reader finished processing it.

There may be and there probably also is a use for this trick but 99 % of the time you should probably not consider doing something like that.

pjc50 · on Feb 25, 2015

malicious client

Inside the same process? Is this really a risk one can sensibly defend against? A malicious client can take your copy of the data, scan the entire memory space of the process for the other copy, and overwrite that.

danbruc · on Feb 25, 2015

Yeah, you are probably right. I mostly used managed languages for the last ten years or so and slowly start forgetting what unrestricted access to the entire address space even means. In a managed context not everything is lost if you have malicious code in your process, but then again it would probably be quite hard to make use of manual address space mappings there. So I retreat my position to making life a bit harder for malicious code if you avoid handing out pointers into your private data.

rdtsc · on Feb 25, 2015

Very cool stuff. I like the trick.

Mmap areas can be tricky sometimes if you directly cast their areas to a struct, depending on compiler optimization you might have make some things "volatile". I remember hitting a bug along those lines.

You'll also get SIGBUS errors on Linux. Was kind of suprised first time by those as well.

asveikau · on Feb 25, 2015

mmap seems very awesome when you first get to know it. You enter one of those "I just found a new programming technique" phases where you naively want to do all your I/O that way because you have just seen the light.

Then hopefully you start to understand the SIGBUS problem. I/O failure becomes indistinguishable from a bad pointer dereference. Oh wait, maybe I/O and memory really should be separate...

At least that's how I felt about it. From what I see many people do not reach that last phase.

mtanski · on Feb 25, 2015

With great power comes great responsibility. mmap is one of those tools.

Keep in mind that your whole linux system essentially mmaps your binaries / shared libraries when you run an application. And with caveats our world still keeps going around.

Error handling with mmap is a PITA but there's a few ways you can work around the general cases:

Use used mapped region for reading data and then use write() for writing it. That's what LMDB does. That's an assumption that's betting errors occurring in the write path.

If you're doing IO in a tight loop you can catch the SIGBUS sent to your thread (SIGBUS/SIGSEGV are always deliver to the thread that caused it). You can deal with the fault via sigsetjmp/siglongjmp. This has all sorts of fun draw backs (like if you're using C++ RAI after sigsetjmp).

asveikau · on Feb 26, 2015

> Keep in mind that your whole linux system essentially mmaps your binaries / shared libraries when you run an application. And with caveats our world still keeps going around.

Yes, and it does very admirable things there, brilliant things I would say. If you aren't going to touch the whole thing it doesn't have to load it from disk. If there is memory pressure it can just evict the in-memory copies of pages. All great stuff for that usage.

That said I have seen it cause issues. Most commonly I'd see it on Windows (it's not called mmap there, but whatever, same issue) if you run an app from a network share. Suddenly network timeouts make the whole app blow up. Not cool. There is actually a flag in the EXE file format that says "if you run this from a network, copy the contents to the pagefile first" - meant for exactly this scenario.

joosters · on Feb 25, 2015

A nice trick!

An alternative API, if you are using a circular buffer that is just being read() into or write() out of, is to make the I/O parts of your code use readv() and writev() instead. The circular buffer call then returns either one or two memory ranges depending on whether the range crosses the end of the buffer or not. Then you achieve the same thing as the mmap trick, full reads and writes with one syscal.

amelius · on Feb 25, 2015

Why not just use the % operator to make the memory wrap around? It seems so much simpler and less prone to errors. Ok, you'll need an extra ALU operation, but these are cheap nowadays, especially if % is implemented by bit-masking.

Also, mmap may confuse the compiler, and to counter that you will have to add "volatile" everywhere, which very likely implies a performance hit anyway.