I am currently taking this class, and am happily surprised this made it here. Th...

userbinator · on Aug 31, 2018

Note that when the lower 32-bit eax portion of the 64-bit rax register is set, the upper 32-bits are unaffected.

That sounds like wishful thinking, and indeed would be expected by someone familiar with the 16-32 transition of the 386 (modifying AX doesn't change the upper 16 bits of EAX.) Instead of getting even more 32-bit registers, or even 64-bit registers accessible in 32-bit mode, AMD64 gave us a weird not-quite-fully-64-bit extension.

I've heard the "partial register stall" excuse multiple times, ostensibly valid but only if you insist on thinking in "partial registers" instead of simply more 32-bit ones as input. For example, some variants of the divide instruction use EDX:EAX (or RDX:RAX) for its input.

Tuna-Fish · on Aug 31, 2018

> I've heard the "partial register stall" excuse multiple times, ostensibly valid but only if you insist on thinking in "partial registers" instead of simply more 32-bit ones as input. For example, some variants of the divide instruction use EDX:EAX (or RDX:RAX) for its input.

That would mean you have to double the amount of state you track. The hardware cost of doing this is ~= the cost of doubling the amount of 64-bit registers. The amount of transistors used for storing register data is negligible compared to the cost of "metadata" and handling around them. Why not just have more registers then?

"Just allow us to partially update the upper halves of the registers" is the sort of thing someone who understands software but not hardware would ask. It's 99% of the cost of just having twice the registers, but not nearly as useful, and it would introduce a lot of potential performance pitfalls. (Any instruction that might update only partially now has to wait for all the previous results on the register.)

userbinator · on Aug 31, 2018

It's 99% of the cost of just having twice the registers, but not nearly as useful, and it would introduce a lot of potential performance pitfalls.

Of the times I've used Asm, there's been far more situations where an extra 32-bit register would be more useful than a 64-bit one, and having them combine automatically into high and low halves is more useful than you think. Tight loops with non-parallelisable bit/byte manipulations of this sort occur quite often in things like data compression and emulation.

Any instruction that might update only partially now has to wait for all the previous results on the register

Does it? Once again, you seem to be thinking in "partial registers" rather than "just another one" --- and I argue that this conceptual difference is very important. E.g. you can work with both AL and AH independently, then use them together as AX --- at which point, yes, the processor will need to wait for the results from both, but then it can combine them implicitly without having to waste time and space decoding and executing the instructions to do it.

simias · on Aug 31, 2018

>Of the times I've used Asm, there's been far more situations where an extra 32-bit register would be more useful than a 64-bit one

You need a 64 bit register any time you want to store a pointer though unless you want to use some kind of a segmented memory model. I don't think anybody wants to go back to that although I'm not one to criticize weird fetishes.

Clearly when you look at the fine details of AMD64 it looks like a weird frankenstein monster of an instruction set. REX prefix holding the MSBs of registers since 32bit opcodes only allowed three bits to encode 8 registers. Same prefix used to set the width of memory target operands, except that some instructions default to 32bits while others default to 64. R12 has weird encoding quircks because it matches RSP in the "low" register set and that register has specific semantics when used as a base register...

I wonder how much die area is used on a modern CPU just to deal with all this cruft and translate it into a saner RISC instruction set internally.

BeeOnRope · on Aug 31, 2018

Well the 16-bit and 8-bit registers work this way, and actual hardware shows this is far from free. We've had partial register stalls, merging uops, and other weird behavior (for example, ah today works very differently from al).

It's easy to say "think of them as separate registers" just as it's easy to say "think of them as a partial registers" - but the ISA definition is such that they have to appear as partial registers in the scenarios where it would be visible.

So sure, you could make hardware that would rename and treat them as different registers (at it has been done on some x86 versions), but then when you read a wider portion you need to combine them, which won't be free (yes it happens "implicitly" but that doesn't somehow make it much easier for the hardware).

There are also plenty of cases where you want zero-extension for functional reasons, especially in compiled code where things like casts to a larger size become free. Cleverly using both halves of a register and using the implicit combination into a full 64-bit value is much rarer that just wanting to store a 32-bit value and sometimes wanting to use it as a zero-extended 64-bit value.

bogomipz · on Sept 1, 2018

>"We've had partial register stalls, merging uops, and other weird behavior (for example, ah today works very differently from al)."

Can you elaborate on the last behavior, that " ah today works very differently from al"? How do they differ? I hadn't heard this before.

BeeOnRope · on Sept 1, 2018

You can find all the gory details at:

https://stackoverflow.com/a/45660140/149138

ant6n · on Aug 31, 2018

I think ARM64 has instructions for inserting ranges of bits from one register into another.

haberman · on Aug 31, 2018

Interesting. It seems then that we can understand the difference between the 16->32 design of IA-32 (1985) and the 32->64 design of x86-64 (2003) as being heavily influenced by the fact that out-of-order execution had become an important consideration in 2003?

moefh · on Aug 31, 2018

Another thing that could be fixed: the table in page 9 says long is 32 bits (with a footnote saying that depends on the compiler, the value shown is for gcc and g++).

That's only true when compiling for x86 (32 bits). When targeting x86-64 (the subject of the book), long is 64 bits in gcc and g++.

dsamarin · on Sept 1, 2018

Thanks, "fixing now!"

jmsmistral · on Aug 31, 2018

Will these be making their way back to the author so the text is updated? Or documented somewhere?

Cheers!

dsamarin · on Aug 31, 2018

Yes, I plan to let him know soon!

EDIT: I found out he is providing extra credit for textbook errors.

AdmiralAsshat · on Aug 31, 2018

Ask him to put the text onto GH/GL or something so that we can submit PR's with corrections. I haven't founded any code errors yet, but quite a few proofreading things I could point out to him. Parallelism with some of the section headings was the first thing that jumped out at me. e.g:

  1.3 Why Learn Assembly Language
  1.3.1 Gain a Better Understanding of Architecture Issues
  1.3.1 Understanding the Tool Chain
  1.3.1 Improves Understanding of Functions/Procedures

Should all be in the same tense.

jcranmer · on Aug 31, 2018

> For example, on page 11 it says "Note that when the lower 32-bit eax portion of the 64-bit rax register is set, the upper 32-bits are unaffected." In reality, the high order bits are zeroed to avoid a data dependency.

Well, there is one case where the upper 32-bits are not zeroed. It turns out that xor eax, eax is assigned to opcode 0x90, which is better known to most people as NOP.

If you want real fun, read up on what happens with AVX registers. Whether or not you leave untouched or zero the upper bits are dependent on if you use VEX encoding or not.

dsamarin · on Aug 31, 2018

No, in section 3.4.1.1 General-Purpose Registers in 64-Bit Mode of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, it says, "32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register."

`xor eax, eax` actually generates 0x31 0xc0, and `xor rax, rax` generates 0x48 0x31 0xc0. 0x90 decodes to xchg eax, eax in all modes except long mode, which has no effect. In long mode, the opcode 0x90 has no effect still but is no longer equal to xchg eax, eax.

CalChris · on Aug 31, 2018

Indeed xchg eax,eax is a nop idiom; there are many. Recent microarchitectures simply ignore nops without executing anything. From the Intel® 64 and IA-32 Architectures Optimization Reference Manual:

16.2.2.6 NOP Idioms

NOP instruction is often used for padding or alignment purposes. The Goldmont and later microarchitecture has hardware support for NOP handling by marking the NOP as completed without allocating it into the reservation station. This saves execution resources and bandwidth. Retirement resource is still needed for the eliminated NOP.

BeeOnRope · on Aug 31, 2018

This nop idiom is very special however, since it isn't just about efficiency: if it wasn't a nop idiom, xchg eax, eax would not be a nop at all, it would clear the upper 32 bits, as xchg ebx, ebx does (or any other register other than eax).

msla · on Aug 31, 2018

> It turns out that xor eax, eax is assigned to opcode 0x90, which is better known to most people as NOP.

This can't be true, since xoring a register with itself zeroes that register, and zeroing a register can't possibly be a general NOP instruction.

XOR also sets flags, another thing NOPs can't do.

simcop2387 · on Aug 31, 2018

That's because it's actually xchg eax, eax not xor eax, eax.

CalChris · on Aug 31, 2018

xor eax,eax is a zero (or dependency breaking) idiom. Zero idioms are detected and removed by the renamer. So they have no execution latency.