Memcpy vs. memmove

twoodfin · on Dec 1, 2014

Linus had some strong opinions on the distinction between memcpy and memmove when it became a significant userspace issue on Linux a few years ago:

https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132

I think he's probably right that there's no reason not to alias memcpy to memmove and make the latter smart enough to do the fastest possible thing given its input. Every implementation of memcpy and memmove that I've seen has been sufficiently complicated in order to optimize for big copies that an added comparison or two for bounds checking would seem a drop in the bucket.

But I'm open to practical examples where the performance difference would become significant.

masklinn · on Dec 1, 2014

One of the commits asserts memcpy triggers extra optimisations by hinting the areas are non-overlapping: http://marc.info/?l=openbsd-cvs&m=137026274514948&w=2

> Replace "hot" bcopy() calls in ether_output() with memcpy(). This tells the compiler that source and destination are not overlapping, allowing for more aggressive optimization, leading to a significant performance improvement on busy firewalls.

pkhuong · on Dec 1, 2014

Aliasing optimisations will fire regardless of how memcpy/memmove are implemented.

pascal_cuoq · on Dec 2, 2014

Replying for masklinn, perhaps what they had in mind was the discovery of optimizations that are not allowed by strict aliasing rules. Consider:

    int *p, *q, x;
    ...
    memmove(p, q, 50 * sizeof(int));
    p[2] = 3;
    q[3] = 4;
    x = p[2];

In the example above, it is illegal (with only the information available) to optimize the last assignment to x = 3. Strict aliasing does not apply because everything is int.

If the call to memmove were a call to memcpy instead, a sufficiently creative compiler author could argue that the optimization of the last statement to x = 3 is legal, because the call to memcpy asserts (among other things) that the int pointed by p+2 does not have any bit in common with the int pointed by q+3.

I don't know how often that would trigger or how useful that would be. But I work on a system in which C programs can be annotated with properties (including non-aliasing properties: see http://blog.frama-c.com/index.php?post/2012/07/25/On-the-red... and http://blog.frama-c.com/index.php?post/2012/07/25/The-restri... ), and that makes me pretty confident that this would be sound and, unlike a joke I made two years ago about the redundancy of restrict in C99, workable.

pkhuong · on Dec 2, 2014

That's exactly the kind of optimisation that I had in mind. What I'm saying is that the static analysis would work the same even if memcpy was implemented by jumping to memmove or some linker trick to make it resolve to memmove.

makomk · on Dec 2, 2014

The issue that caused Linus to argue memcpy should just act like memmove involved old binaries breaking with new kernels, so the compiler's behaviour wasn't an issue there.

cbsmith · on Dec 2, 2014

It was a bit more than that. He implied that the performance differences were trivial, so it wasn't worth the trouble.

I think, unfortunately, Linus' argument amounts to demanding that API's conform to how they are used rather than how they are specified, which leads to O_PONIES type problems.

fulafel · on Dec 2, 2014

It was a change in libc, not the kernel. Someone ran valgrind and found the binary-only flash plugin was depending on the undefined memcpy behaviour.

Linus was butting in because a kernel sound driver regression was also suspected earlier...

TazeTSchnitzel · on Dec 2, 2014

Perhaps in debug builds, memcpy could wrap memmove but assert() that the regions are non-overlapping? Would avoid code unintentionally relying on memcpy supporting overlap.

robinhouston · on Dec 1, 2014

I was pretty surprised to learn that “the x86 architectures have a direction flag that can be set to cause the processor to run backwards”.

It turns out not to be quite as dramatic as that description makes it sound: https://en.wikipedia.org/wiki/Direction_flag

tgb · on Dec 1, 2014

I looked up nop sleds but don't see why they're used here. Explanation?

danielweber · on Dec 1, 2014

My best guess is to properly align instructions along a certain boundary, but I'm pulling that out of my hat.

You can put bcopy as a function call right before memmove, and then you don't need one function to call the other, which would cause a stack push. Maybe instructions need to be at addresses = 0 mod 16, so that's the "closest" you can get it. And spinning over ~12 NOPs might be faster than incrementing the PC by ~9.

userbinator · on Dec 2, 2014

Yes, there's been a recommendation floating around for a long time that branch/call targets should be 16-byte aligned. However, my experience is that modern x86 is basically almost completely insensitive to alignment in general (there are some small exceptions, like avoiding crossing cacheline boundaries in a loop.) It wouldn't surprise me if the NOPs only helped on earlier models, had no effect or even a negative effect on more recent CPUs, and were just left in out of tradition.

derf_ · on Dec 2, 2014

Depends on what you consider "modern". Core2 timings changed wildly based on code alignment (experienced myself and documented by others here: http://x264dev.multimedia.cx/archives/51).

cjubb39 · on Dec 1, 2014

Yeah, looks like it's for function alignment padding. It's a pretty common thing at the end of functions to have the next function start on a specific boundary. (even if the first function doesn't fall into the other)

I haven't tested, but I'd bet good money that 12 NOPs would be faster than a jmp.

pbsd · on Dec 1, 2014

You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.

Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.

0x0 · on Dec 1, 2014

What does a 9byte NOP look like?

pkhuong · on Dec 1, 2014

https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64... has

0x66 0x0f 0x1f 0x84 0x00 0x00 0x00 0x00 0x00

That's a size override prefix, followed by the dedicated NOP instruction (0x0f 0x1f), and finally 6 bytes to encode an effective address with offset.

makomk · on Dec 2, 2014

Multi-byte nops have compatibility issues on some of the more obscure 32-bit x86 CPUs, unfortunately: https://sourceware.org/bugzilla/show_bug.cgi?id=13675

pkhuong · on Dec 2, 2014

Right… you have to check cpuid for the long nop feature. I believe 0x66 0x90 is compatible (but slow, I would expect) with older CPUs.

pbsd · on Dec 1, 2014

    nop word ptr [eax+eax+0] ; 66 0f 1f 84 00 00 00 00 00

danielweber · on Dec 2, 2014

Do JMPs invalidate caches? That was the story I was telling myself where 12 NOPs would be faster than JMPing; I don't know if it's true.

jlebar · on Dec 2, 2014

Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)

pbsd · on Dec 2, 2014

I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.

pkhuong · on Dec 1, 2014

Long nops would be marginally quicker.

corysama · on Dec 1, 2014

I believe that on several modern compilers, memcpy and memset are practically -if not literally- treated as intrinsics. As in, the compiler has been granted semantic understanding of those specific functions and can generate very well optimized assembly based on that knowledge. Haven't heard about the same being done for memmove.

justincormack · on Dec 1, 2014

They have got so good at recognising hand written memcpy implementations and replacing with builtins that it is almost becoming impossible to actually write a libc any more as memcpy gets replaced with a pointer to memcpy [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

lgeek · on Dec 2, 2014

What compilers are you thinking about? I've never seen GCC inlining and specializing a call to memcpy, it just calls the generic (but optimized) implementation in the standard library.

corysama · on Dec 2, 2014

http://msdn.microsoft.com/en-us/library/tzkfha43.aspx

http://llvm.org/docs/LangRef.html#standard-c-library-intrins...

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

https://news.ycombinator.com/item?id=4711663

http://www.gonwan.com/?p=13

http://blog.quarkslab.com/a-glance-at-compiler-internals-kee...

lgeek · on Dec 2, 2014

Yeah, you were right. I've checked and it seems the GCC version I'm using is inlining memcpy calls if it can statically determine the size argument and if the size is lower or equal to 64 bytes.

nkurz · on Dec 2, 2014

I take it that you are a long-time C programmer who hasn't realized that the language is no longer the one you grew up loving. That casting a float to an int to do some bit twiddling is now undefined behavior of the sort that technically entitles the compiler have it's way with your root filesystem. Although usually it will just let you off with a warning and a gaping hole where your null checking code used to be.

Believe it or not, memcpy() is now the only portable and properly defined way to cast one type to another in C. You are expected to use it with the explicit intent that no bytes actually be copied, simply as a (completely voluntary) offering to the type gods. If they accept your offering, the call will just disappear from your code, and never be made. If you fail to make this offering, they feel entitled to excise an equal quantity of other code to punish you.

Sure, you the programmer happen to know that bits are bits, and that on your system they are already in the register you want to use, but the compiler stopped playing that game years ago. If you wanted to work close to the metal, you should have chosen a language more appropriate for the task. There's a great discussion of the issues in this epic comp.arch thread: http://compgroups.net/comp.arch/if-it-were-easy/2993157

Search for the first occurence of 'memcpy', where you'll find a polite but beleaguered Terje Mathisen asking for the best way to portably cast a float to an integer in C. Then keep searching forward for further occurrences of memcpy as the situation becomes surreal, with a GCC maintainer Mike Stump eventually clearing things up:

  >>> So what is the blessed method?
  >>
  >> Just memcpy it, simple sweet, fast, standard.
  >
  >So what you are saying is that memcpy() isn't just magic but high magic:

  No.  What I think I'm saying is that it is standard.  See the quoted
  text above.  The word magic [ checking ] doesn't not occur in c99-tc3.
  What is defined in that standard is memcpy, it is as standard as if,
  which is also defined.

  >It looks like a function call but can be whatever the compiler wants it 
  >to be, as long as the results behave as if the data was actually copied.
  >:-)

  No, you fail to grasp the totality of the standard.   The implementation 
  is free to do _anything_ it wants.  The only constraint is that the user 
  can't figure that it deviated from the required semantics by using a standards
  defined mechanism for figuring it out.  We call this the as-if rule, and its
  power is awesome; we could destroy the universe, and repiece it back together
  one subatomic particle at a time over a billion years, and still be compliant,
  if we wanted.

So there you have it: if you want to reuse the exact contents of a register as a different type in a way that you know will work, the correct approach is to use memcpy() to copy the data from one variable to another, and then hope without confirmation that GCC has optimized out the call to make it equivalent to a simple cast.

If you want a better understanding of the GCC mindest, the rest of Mike's posts are excellent: cutting, extremely techically accurate, and (in my opinion) completely missing the point of why some people are unhappy with the direction that C is evolving.

stephencanon · on Dec 2, 2014

C11 explicitly blessed unions for type-punning as well as memcpy (actually, C99 TC2 did). That said, people should just use memcpy.

nkurz · on Dec 2, 2014

Yes, it's now in the standard (someone in the thread I linked pointed to TC3 draft, §6.5.2.3, footnote 82) but apparently there are issues with the way GCC implements in its default dialect that make it unsafe. More Mike Stump:

  >>> So what about using a union?
  >>
  >> Most people screw it up, the rules for it working are slightly odd.
  >> Those rules are not standard[1], rather they are gcc specific, so in very
  >
  >Ouch, so what you are saying is that this is one of those (according to 
  >Nick M) barely defined areas of the language?

  What I am saying is that it is defined to not work in the language
  standard gcc implements by default.  There is no barely, there is no
  defined.  As an extension to the language standard, gcc implements
  (defines) a few things so that users can make some non-standard things
  always work.  I say slightly odd, as the rules are just a tad harder
  than trivial.

I prefer the union approach to memcpy() because I can more easily reason about its behaviour, but veiled warnings from GCC maintainers scare me away from it. But perhaps I misunderstand the warning.

For that matter, I prefer simple casts to unions, and while I agree the spec makes it undefined, I don't yet see why all common implementations couldn't simply make it work in all reasonable cases. I'm currently trying to get used to using memcpy() for type annotation, but it still feels unnatural to write a function I don't want executed (yes, I have trouble with setters/getters too).

What I'd like is to have a way to specify to the compiler that I want it to compile the code I give it as written as best as it can, rather than optimizing it out as undefined. As it is, I often resort to inline assembly if I actually want an operation to happen. It seems like there should be an intermediate approach, a hypothetical '#pragma "dwim"' that could avoid this.

pmalynin · on Dec 1, 2014

repne movsb

nhaehnle · on Dec 1, 2014

I guess somebody downvoted you for the shortness of your comment. You probably know this, but repne movsb is actually a fairly slow approach to copying data compared to the various SSE assembly implementations that libc and friends have. The only thing repne movsb has going for it is how short its encoding is.

It is a bit bizarre that modern CPUs don't have an instruction for something as basic as "copy N bytes as fast as you can", and instead we're in a situation where library and compiler writers have to tune their assembly for different micro-architectures. (I'm not saying that it's a wrong decision, there are probably good reasons for doing things this way, but it's not something you'd expect.)

stephencanon · on Dec 1, 2014

"rep movsb" (note: not "repne") actually is one of the fastest ways to copy memory on Ivybridge and Haswell (about as fast as vectorized copy implementations when the data is resident in L1/L2, and significantly faster than vector copy loops when the buffers are too large to be cache resident). On micro-architectures preceding Ivybridge, rep movsb is indeed slow.

It's still not quite "as fast as you can"; it's possible to beat rep movsb when buffers are small due to edging effects, but it's about as close as it's possible to come to that today.

userbinator · on Dec 2, 2014

On micro-architectures preceding Ivybridge, rep movsb is indeed slow.

Actually, rep movsb/movsw/movsd has been at the top at least since Nehalem (confirmed with benchmarks), and "fast string mode" which does cacheline-sized copies has been around since the P6; you may be able to squeeze out a few % more with SSE (or MMX), but the much larger code of the SSE-based copy functions (especially for alignment) is often not a win overall. Intel only really started advertising that they made it even faster with Ivybridge.

It was the fastest way to do memory copies on the 8086/8088, and might've been the fastest until around the time of the 486 and Pentium when it could be beaten by other techniques; but now it seems that it's coming back in favour.

There's an interesting discussion about this instruction on the Intel forums here: https://software.intel.com/en-us/forums/topic/275765

stephencanon · on Dec 2, 2014

It was still nowhere near as fast as a good software sequence on Nehalem or Sandybridge. The startup cost to the microcode engine was simply too high, and its handling of misalignment was very inefficient. For "small" (< 1K) unaligned buffers (which are actually a large portion of memcpy usage on a live system), good software sequences were 3-4x faster than REP MOVS.

It did perform competitively for large all-aligned copies, which are what most people tend to look at when they post "memcpy benchmarks", but those turn out to be a relatively small portion of actual usage in most workloads.

nhaehnle · on Dec 2, 2014

Wow, thanks for pointing this out. I just tried it out for myself, and indeed "rep movsb" is consistently (and sometimes significantly) faster than the standard C memcpy for aligned copies larger than 16KB or so on my Intel Core i5 (for unaligned copies, it seems to be on par). It is slightly slower or on par for smaller sizes. There is no noticeable difference between rep movsb and repmovsq.

Apparently, libc hasn't caught up to those micro-architecture changes yet :/

stephencanon · on Dec 2, 2014

Depends on whose libc you're using. Good implementations definitely take advantage of rep movsb where it's fast.

nhaehnle · on Dec 2, 2014

I guess the libc that comes with Ubuntu doesn't count as a good implementation.

stephencanon · on Dec 2, 2014

glibc is generally pretty good, but does lag commercial libc implementations somewhat when it comes to microarchitectural optimization. It's also not unheard of for Linux distros to include rather old versions of glibc. I have no idea if that's the problem in your case, but it's worth checking that you have the latest.

akira2501 · on Dec 2, 2014

> You probably know this, but repne movsb is actually a fairly slow approach to copying data compared to the various SSE assembly implementations that libc and friends have.

That used to be the case. As is mentioned, modern processors have fixed the 'rep' prefix to be much faster. Under linux, if your /proc/cpuinfo shows 'rep_good' in the flags section, then your CPU is of this newer class.

stephencanon · on Dec 2, 2014

Just to be perfectly precise, not all uses of the rep prefix are fast now. Only rep movs and rep stos are improved, IIRC.

pmalynin · on Dec 2, 2014

Yes I agree, what bothers me is that instead of that we are getting micro-code bloating instructions such as PEXT and PDEP.

nkurz · on Dec 2, 2014

Personally, I think those two are great. One of my tasks recently as accelerating VByte encoding, and PDEP/PEXT are practically made for the task. I don't think either of them is micro-coded. Both are 3-cycle latency, but use only a single Port 1 µop. I really like the BMI/BMI2 instructions sets, and find them to be of equal or greater immediate benefit than AVX2 (which I like too).

stephencanon · on Dec 2, 2014

PDEP and PEXT are wonderful little instructions (and as you say, they are not micro-coded).

anon4 · on Dec 2, 2014

I keep dreaming of a world where memcpy is an instruction to the memory controller only and doesn't hold up the CPU at all. It feels kind of silly to have this complicated piece of silicone with all its amazing processing capabilities just sit there reading RAM into registers and then writing them back doing precisely nothing.

nsrango · on Dec 2, 2014

This is done but mostly for peripheral IO (like DACs/ADCs, disk, video or audio codecs and even between non-shared memory for processors) but can (and has) been used in the main memory hierarchy . http://en.wikipedia.org/wiki/Direct_memory_access

Torgo · on Dec 2, 2014

Some game systems did this too. For instance the Gameboy Advance had DMA, and it was used all the time because if you were eating up your cycles on copying memory, you would run out of time to process your game logic before your vertical blank ended and the system started drawing onscreen.

I also saw it used to save memory. Onscreen sprites had to be placed into a small memory-mapped address range, so instead of putting every character animation frame into this space and using it all up, I saw just one block designated for the character, and it was animated by DMAing the frames sequentially into that space.

Arelius · on Dec 2, 2014

Yeah, sure you could implement that in the memory controller, but it seems like a waste of silicon to do something that you really shouldn't be needing to do very often anyways. And even if it could, it would be saturating the bus bandwidth that the CPU would be waiting on otherwise. So you're most likely to just end up with an idle CPU anyways, so might as well use the silicon we have already.

Edit: Ohh, I forgot to mention that you will have to update/invalidate any resident cache lines on the CPU also, which requires communicating with the CPU.

vparikh · on Dec 2, 2014

I believe the Amiga was the first computer system to just this - although with the massive CPU board caches that we have today, I don't think this would be as useful.

http://en.wikipedia.org/wiki/Blitter

al2o3cr · on Dec 2, 2014

The memory controller is not a great place for this in general - for instance, imagine memcpying to a 8kB region whose 2nd page has been flushed to disk. Now the memory controller needs to be able to resolve page faults...

jjaredsimpson · on Dec 2, 2014

>cone

angersock · on Dec 1, 2014

Note that, if memory serves correctly, many vendors just alias memcpy to memmove anyways, because the performance is pretty similar and the bugs prevented are super annoying.

jandrese · on Dec 1, 2014

It's fun to benchmark memmove and memcpy on a box to see if memcpy has more optimizations or not. On Linux x86_64 gcc memcpy is usually twice as fast when you're not bound by cache misses, while both are roughly the same on FreeBSD x86_64 gcc.

Linux (2.4Ghz Xeon X3430):

./memtest 10000 1000000

memcpy took 0.575571 seconds

memmove took 1.082038 seconds

FreeBSD (2.0Ghz AMD Athlon64 3000):

./memtest 10000 1000000

memcpy took 1.487334 seconds

memmove took 1.442741 seconds

TrainedMonkey · on Dec 1, 2014

I am sure that different memory controllers and CPU specific of optimizations of Intel vs AMD play a small role as well.

lbenes · on Dec 2, 2014

What memtest are you using? This one:

https://github.com/UK-MAC/WMTools/blob/master/tests/memtest....

jandrese · on Dec 2, 2014

It's one I wrote myself to test this very thing after reading the memmove manpage. I was wondering why you would ever use memcpy unless it had a big performance advantage, so I wrote a micro-benchmark to see. If you are really interested, I can put my benchmark program up somewhere.

Additional findings: Unsurprisingly using clang instead of gcc does not affect performance on FreeBSD or Linux.

lbenes · on Dec 3, 2014

Yes please. If you want share it, I'm curious how different compilers and architectures handle memmove vs memcpy.