I think he's probably right that there's no reason not to alias memcpy to memmove and make the latter smart enough to do the fastest possible thing given its input. Every implementation of memcpy and memmove that I've seen has been sufficiently complicated in order to optimize for big copies that an added comparison or two for bounds checking would seem a drop in the bucket.
But I'm open to practical examples where the performance difference would become significant.
> Replace "hot" bcopy() calls in ether_output() with memcpy(). This tells the compiler that source and destination are not overlapping, allowing for more aggressive optimization, leading to a significant performance improvement on busy firewalls.
Replying for masklinn, perhaps what they had in mind was the discovery of optimizations that are not allowed by strict aliasing rules. Consider:
int *p, *q, x;
...
memmove(p, q, 50 * sizeof(int));
p[2] = 3;
q[3] = 4;
x = p[2];
In the example above, it is illegal (with only the information available) to optimize the last assignment to x = 3. Strict aliasing does not apply because everything is int.
If the call to memmove were a call to memcpy instead, a sufficiently creative compiler author could argue that the optimization of the last statement to x = 3 is legal, because the call to memcpy asserts (among other things) that the int pointed by p+2 does not have any bit in common with the int pointed by q+3.
That's exactly the kind of optimisation that I had in mind. What I'm saying is that the static analysis would work the same even if memcpy was implemented by jumping to memmove or some linker trick to make it resolve to memmove.
The issue that caused Linus to argue memcpy should just act like memmove involved old binaries breaking with new kernels, so the compiler's behaviour wasn't an issue there.
It was a bit more than that. He implied that the performance differences were trivial, so it wasn't worth the trouble.
I think, unfortunately, Linus' argument amounts to demanding that API's conform to how they are used rather than how they are specified, which leads to O_PONIES type problems.
Perhaps in debug builds, memcpy could wrap memmove but assert() that the regions are non-overlapping? Would avoid code unintentionally relying on memcpy supporting overlap.
My best guess is to properly align instructions along a certain boundary, but I'm pulling that out of my hat.
You can put bcopy as a function call right before memmove, and then you don't need one function to call the other, which would cause a stack push. Maybe instructions need to be at addresses = 0 mod 16, so that's the "closest" you can get it. And spinning over ~12 NOPs might be faster than incrementing the PC by ~9.
Yes, there's been a recommendation floating around for a long time that branch/call targets should be 16-byte aligned. However, my experience is that modern x86 is basically almost completely insensitive to alignment in general (there are some small exceptions, like avoiding crossing cacheline boundaries in a loop.) It wouldn't surprise me if the NOPs only helped on earlier models, had no effect or even a negative effect on more recent CPUs, and were just left in out of tradition.
Depends on what you consider "modern". Core2 timings changed wildly based on code alignment (experienced myself and documented by others here: http://x264dev.multimedia.cx/archives/51).
Yeah, looks like it's for function alignment padding. It's a pretty common thing at the end of functions to have the next function start on a specific boundary. (even if the first function doesn't fall into the other)
I haven't tested, but I'd bet good money that 12 NOPs would be faster than a jmp.
You can do an unconditional jump every 1 or 2 cycles, depending on the chip, whereas no chip I know of can execute more than 4 nops per cycle. Therefore I would say the jump is probably marginally faster than 12 nops.
Smart toolchains will turn those 12 bytes into 2 multi-byte nops, e.g., a 9-byte one and a 3-byte one.
Loops can be implemented with JMPs, and it would be Very Bad if every iteration of a loop invalidated caches. (In fact, it would be Very Bad if just about anything common invalidated caches, given how important they are to modern CPU performance.)
I don't know what exactly you mean by that, but I'm going with no. Unconditional jumps do interact with the uops cache in recent Intel chips, but they do so by terminating the current uops cache line---which is generally desirable.
I believe that on several modern compilers, memcpy and memset are practically -if not literally- treated as intrinsics. As in, the compiler has been granted semantic understanding of those specific functions and can generate very well optimized assembly based on that knowledge. Haven't heard about the same being done for memmove.
They have got so good at recognising hand written memcpy implementations and replacing with builtins that it is almost becoming impossible to actually write a libc any more as memcpy gets replaced with a pointer to memcpy [1].
What compilers are you thinking about? I've never seen GCC inlining and specializing a call to memcpy, it just calls the generic (but optimized) implementation in the standard library.
Yeah, you were right. I've checked and it seems the GCC version I'm using is inlining memcpy calls if it can statically determine the size argument and if the size is lower or equal to 64 bytes.
I take it that you are a long-time C programmer who hasn't realized that the language is no longer the one you grew up loving. That casting a float to an int to do some bit twiddling is now undefined behavior of the sort that technically entitles the compiler have it's way with your root filesystem. Although usually it will just let you off with a warning and a gaping hole where your null checking code used to be.
Believe it or not, memcpy() is now the only portable and properly defined way to cast one type to another in C. You are expected to use it with the explicit intent that no bytes actually be copied, simply as a (completely voluntary) offering to the type gods. If they accept your offering, the call will just disappear from your code, and never be made. If you fail to make this offering, they feel entitled to excise an equal quantity of other code to punish you.
Sure, you the programmer happen to know that bits are bits, and that on your system they are already in the register you want to use, but the compiler stopped playing that game years ago. If you wanted to work close to the metal, you should have chosen a language more appropriate for the task. There's a great discussion of the issues in this epic comp.arch thread: http://compgroups.net/comp.arch/if-it-were-easy/2993157
Search for the first occurence of 'memcpy', where you'll find a polite but beleaguered Terje Mathisen asking for the best way to portably cast a float to an integer in C. Then keep searching forward for further occurrences of memcpy as the situation becomes surreal, with a GCC maintainer Mike Stump eventually clearing things up:
>>> So what is the blessed method?
>>
>> Just memcpy it, simple sweet, fast, standard.
>
>So what you are saying is that memcpy() isn't just magic but high magic:
No. What I think I'm saying is that it is standard. See the quoted
text above. The word magic [ checking ] doesn't not occur in c99-tc3.
What is defined in that standard is memcpy, it is as standard as if,
which is also defined.
>It looks like a function call but can be whatever the compiler wants it
>to be, as long as the results behave as if the data was actually copied.
>:-)
No, you fail to grasp the totality of the standard. The implementation
is free to do _anything_ it wants. The only constraint is that the user
can't figure that it deviated from the required semantics by using a standards
defined mechanism for figuring it out. We call this the as-if rule, and its
power is awesome; we could destroy the universe, and repiece it back together
one subatomic particle at a time over a billion years, and still be compliant,
if we wanted.
So there you have it: if you want to reuse the exact contents of a register as a different type in a way that you know will work, the correct approach is to use memcpy() to copy the data from one variable to another, and then hope without confirmation that GCC has optimized out the call to make it equivalent to a simple cast.
If you want a better understanding of the GCC mindest, the rest of Mike's posts are excellent: cutting, extremely techically accurate, and (in my opinion) completely missing the point of why some people are unhappy with the direction that C is evolving.
Yes, it's now in the standard (someone in the thread I linked pointed to TC3 draft, §6.5.2.3, footnote 82) but apparently there are issues with the way GCC implements in its default dialect that make it unsafe. More Mike Stump:
>>> So what about using a union?
>>
>> Most people screw it up, the rules for it working are slightly odd.
>> Those rules are not standard[1], rather they are gcc specific, so in very
>
>Ouch, so what you are saying is that this is one of those (according to
>Nick M) barely defined areas of the language?
What I am saying is that it is defined to not work in the language
standard gcc implements by default. There is no barely, there is no
defined. As an extension to the language standard, gcc implements
(defines) a few things so that users can make some non-standard things
always work. I say slightly odd, as the rules are just a tad harder
than trivial.
I prefer the union approach to memcpy() because I can more easily reason about its behaviour, but veiled warnings from GCC maintainers scare me away from it. But perhaps I misunderstand the warning.
For that matter, I prefer simple casts to unions, and while I agree the spec makes it undefined, I don't yet see why all common implementations couldn't simply make it work in all reasonable cases. I'm currently trying to get used to using memcpy() for type annotation, but it still feels unnatural to write a function I don't want executed (yes, I have trouble with setters/getters too).
What I'd like is to have a way to specify to the compiler that I want it to compile the code I give it as written as best as it can, rather than optimizing it out as undefined. As it is, I often resort to inline assembly if I actually want an operation to happen. It seems like there should be an intermediate approach, a hypothetical '#pragma "dwim"' that could avoid this.
I guess somebody downvoted you for the shortness of your comment. You probably know this, but repne movsb is actually a fairly slow approach to copying data compared to the various SSE assembly implementations that libc and friends have. The only thing repne movsb has going for it is how short its encoding is.
It is a bit bizarre that modern CPUs don't have an instruction for something as basic as "copy N bytes as fast as you can", and instead we're in a situation where library and compiler writers have to tune their assembly for different micro-architectures. (I'm not saying that it's a wrong decision, there are probably good reasons for doing things this way, but it's not something you'd expect.)
"rep movsb" (note: not "repne") actually is one of the fastest ways to copy memory on Ivybridge and Haswell (about as fast as vectorized copy implementations when the data is resident in L1/L2, and significantly faster than vector copy loops when the buffers are too large to be cache resident). On micro-architectures preceding Ivybridge, rep movsb is indeed slow.
It's still not quite "as fast as you can"; it's possible to beat rep movsb when buffers are small due to edging effects, but it's about as close as it's possible to come to that today.
On micro-architectures preceding Ivybridge, rep movsb is indeed slow.
Actually, rep movsb/movsw/movsd has been at the top at least since Nehalem (confirmed with benchmarks), and "fast string mode" which does cacheline-sized copies has been around since the P6; you may be able to squeeze out a few % more with SSE (or MMX), but the much larger code of the SSE-based copy functions (especially for alignment) is often not a win overall. Intel only really started advertising that they made it even faster with Ivybridge.
It was the fastest way to do memory copies on the 8086/8088, and might've been the fastest until around the time of the 486 and Pentium when it could be beaten by other techniques; but now it seems that it's coming back in favour.
It was still nowhere near as fast as a good software sequence on Nehalem or Sandybridge. The startup cost to the microcode engine was simply too high, and its handling of misalignment was very inefficient. For "small" (< 1K) unaligned buffers (which are actually a large portion of memcpy usage on a live system), good software sequences were 3-4x faster than REP MOVS.
It did perform competitively for large all-aligned copies, which are what most people tend to look at when they post "memcpy benchmarks", but those turn out to be a relatively small portion of actual usage in most workloads.
Wow, thanks for pointing this out. I just tried it out for myself, and indeed "rep movsb" is consistently (and sometimes significantly) faster than the standard C memcpy for aligned copies larger than 16KB or so on my Intel Core i5 (for unaligned copies, it seems to be on par). It is slightly slower or on par for smaller sizes. There is no noticeable difference between rep movsb and repmovsq.
Apparently, libc hasn't caught up to those micro-architecture changes yet :/
glibc is generally pretty good, but does lag commercial libc implementations somewhat when it comes to microarchitectural optimization. It's also not unheard of for Linux distros to include rather old versions of glibc. I have no idea if that's the problem in your case, but it's worth checking that you have the latest.
> You probably know this, but repne movsb is actually a fairly slow approach to copying data compared to the various SSE assembly implementations that libc and friends have.
That used to be the case. As is mentioned, modern processors have fixed the 'rep' prefix to be much faster. Under linux, if your /proc/cpuinfo shows 'rep_good' in the flags section, then your CPU is of this newer class.
Personally, I think those two are great. One of my tasks recently as accelerating VByte encoding, and PDEP/PEXT are practically made for the task. I don't think either of them is micro-coded. Both are 3-cycle latency, but use only a single Port 1 µop. I really like the BMI/BMI2 instructions sets, and find them to be of equal or greater immediate benefit than AVX2 (which I like too).
I keep dreaming of a world where memcpy is an instruction to the memory controller only and doesn't hold up the CPU at all. It feels kind of silly to have this complicated piece of silicone with all its amazing processing capabilities just sit there reading RAM into registers and then writing them back doing precisely nothing.
This is done but mostly for peripheral IO (like DACs/ADCs, disk, video or audio codecs and even between non-shared memory for processors) but can (and has) been used in the main memory hierarchy . http://en.wikipedia.org/wiki/Direct_memory_access
Some game systems did this too. For instance the Gameboy Advance had DMA, and it was used all the time because if you were eating up your cycles on copying memory, you would run out of time to process your game logic before your vertical blank ended and the system started drawing onscreen.
I also saw it used to save memory. Onscreen sprites had to be placed into a small memory-mapped address range, so instead of putting every character animation frame into this space and using it all up, I saw just one block designated for the character, and it was animated by DMAing the frames sequentially into that space.
Yeah, sure you could implement that in the memory controller, but it seems like a waste of silicon to do something that you really shouldn't be needing to do very often anyways. And even if it could, it would be saturating the bus bandwidth that the CPU would be waiting on otherwise. So you're most likely to just end up with an idle CPU anyways, so might as well use the silicon we have already.
Edit: Ohh, I forgot to mention that you will have to update/invalidate any resident cache lines on the CPU also, which requires communicating with the CPU.
I believe the Amiga was the first computer system to just this - although with the massive CPU board caches that we have today, I don't think this would be as useful.
The memory controller is not a great place for this in general - for instance, imagine memcpying to a 8kB region whose 2nd page has been flushed to disk. Now the memory controller needs to be able to resolve page faults...
Note that, if memory serves correctly, many vendors just alias memcpy to memmove anyways, because the performance is pretty similar and the bugs prevented are super annoying.
It's fun to benchmark memmove and memcpy on a box to see if memcpy has more optimizations or not. On Linux x86_64 gcc memcpy is usually twice as fast when you're not bound by cache misses, while both are roughly the same on FreeBSD x86_64 gcc.
It's one I wrote myself to test this very thing after reading the memmove manpage. I was wondering why you would ever use memcpy unless it had a big performance advantage, so I wrote a micro-benchmark to see. If you are really interested, I can put my benchmark program up somewhere.
Additional findings: Unsurprisingly using clang instead of gcc does not affect performance on FreeBSD or Linux.
https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132
I think he's probably right that there's no reason not to alias memcpy to memmove and make the latter smart enough to do the fastest possible thing given its input. Every implementation of memcpy and memmove that I've seen has been sufficiently complicated in order to optimize for big copies that an added comparison or two for bounds checking would seem a drop in the bucket.
But I'm open to practical examples where the performance difference would become significant.