The fact that some subproblems people want to solve are NP complete is mostly irrelevant, and has been for many many years. Nobody cares about optimal copy coalescing, only "good enough". There are studies about optimal vs non-optimal, and the answer is basically "you could eliminate 20% more copies". That was against the best optimistic coalescers of the time, which have already been superseded by PBQP and other approaches.
Also remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime.
As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.
Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.
For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.
At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.
remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime
It sounds like you may be referring to MOV elimination by register renaming, which should make the extra moves no more costly than NOP's. I read that Intel post-Ivy Bridge does this, but haven't been able to find any real documentation. Do you know if this is something one can now rely on, or what the limits of this are (number per cycle, size differences, latency)?
Answering myself: Yes, this is documented and can be depended on for Ivy Bridge onward.
3.5.1.13 Zero-Latency MOV Instructions
In processors based on Intel microarchitecture code named Ivy Bridge, a subset of
register-to-register move operations are executed in the front end (similar to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution resources in the
out-of-order engine. Most forms of register-to-register MOV instructions can benefit
from zero-latency MOV. Example 3-23 list the details of those forms that qualify and
a small set that do not.
As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.
Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.
For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.
At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.