Hacker News new | past | comments | ask | show | jobs | submit login

The fact that some subproblems people want to solve are NP complete is mostly irrelevant, and has been for many many years. Nobody cares about optimal copy coalescing, only "good enough". There are studies about optimal vs non-optimal, and the answer is basically "you could eliminate 20% more copies". That was against the best optimistic coalescers of the time, which have already been superseded by PBQP and other approaches. Also remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime.

As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.

Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.

For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.

At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.




remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime

It sounds like you may be referring to MOV elimination by register renaming, which should make the extra moves no more costly than NOP's. I read that Intel post-Ivy Bridge does this, but haven't been able to find any real documentation. Do you know if this is something one can now rely on, or what the limits of this are (number per cycle, size differences, latency)?


Answering myself: Yes, this is documented and can be depended on for Ivy Bridge onward.

3.5.1.13 Zero-Latency MOV Instructions

In processors based on Intel microarchitecture code named Ivy Bridge, a subset of register-to-register move operations are executed in the front end (similar to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution resources in the out-of-order engine. Most forms of register-to-register MOV instructions can benefit from zero-latency MOV. Example 3-23 list the details of those forms that qualify and a small set that do not.

Example 3-23. Zero-Latency MOV Instructions

MOV instructions latency that can be eliminated

  MOV reg32, reg32
  MOV reg64, reg64
  MOVUPD/MOVAPD xmm, xmm
  MOVUPD/MOVAPD ymm, ymm
  MOVUPS?MOVAPS xmm, xmm
  MOVUPS/MOVAPS ymm, ymm
  MOVDQA/MOVDQU xmm, xmm
  MOVDQA/MOVDQU ymm, ymm
  MOVZX reg32, reg8 (if not AH/BH/CH/DH)
  MOVZX reg64, reg8 (if not AH/BH/CH/DH)
MOV instructions latency that cannot be eliminated

  MOV reg8, reg8
  MOV reg16, reg16
  MOVZX reg32, reg8 (if AH/BH/CH/DH)
  MOVZX reg64, reg8 (if AH/BH/CH/DH)
  MOVSX

http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: