Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Zen is already light on vector units, and microcodes 256-bit operations.

It's certainly possible to build a more lightweight core, but most of that work is reducing the complexity of the out-of-order machinery. The FPU+ALU is under a quarter of each Zen core. https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annota...



It is definitely not "microcoded" - 256-bit operations are just sent in halves to the 128-bit ALUs and combined for the final answer.

Don't get mixed up between "microcoding" and "micro-op" - the latter is something different, slower and which usually requires some kind of transition in the decoders and uop caches to start reading microcoded ops. The latter is the "normal" or "fast" mode for the CPU and just because one instruction turns into two uops (or macro-ops or whatever AMD calls them) doens't mean microcoded.


Sorry, I meant "the former is something different..."


You do want it to out-of-order and branch predict and speculate enough to issue speculative RAM reads as soon as a possibly needed address is available, to hide RAM latency (as long as rollbacks of speculatively executed operations hide the loaded values in the cache so the speculation leaves no side effects), that is important for performance.

In that picture (thanks!) I see the FPU is big, the decoder is big, the branch predictor is big, the rest is probably needed. Maybe emulated FPU is good for some workloads, maybe ability to program in microinstructions instead of x86-64 is useful too. But maybe silicon area is not the expensive thing (dark silicon, etc.).


Do you have a link about microcoded avx256? I would think it would be way too slow.


> 256-bit vector instructions (AVX instructions) are split into two micro-ops handling 128 bits each.

https://www.agner.org/optimize/blog/read.php?i=838#838

This is responsible for Ryzen losing to Intel in SIMD heavy benchmarks. The upside is that it avoids the reduced turbo boost Intel does for some 256-bit AVX instructions (and even worse downclocks caused by AVX 512), so for workloads mixing avx and normal instructions it shouldn't do too badly.


I see, yep - but this is still hardwired stuff happening in instruction decode, as Agner writes, not trapping to microcode sequencing.


Is there a good resource that explains the difference between those ("decode" vs. "trapping") on a modern CPU? When I see "trap" I imagine the kernel catching illegal instruction exceptions and emulating them in software, but it doesn't seem like that's what you mean?


Modern CPUs execute micro ops, RISC-like instructions (e.g. load from memory address to register, add two registers, store from register to memory address). The CPU's "decode" stage translates x86 instructions into micro-ops (often 1-to-1, but x86 compare followed by x86 jump are translated into a single micro-op, while some x86 instructions are translated into multiple micro-ops).

On one CPU model a x86 operation like "256bit add" might translate into "256bit add" micro-op, and on another model the same x86 operation might be translated into a series of micro-ops like "128bit add, wait a cycle for the 1st add to finish, pass the carry bit into a 2nd 128 bit add", because that model doesn't have a real 256bit adder. So the latency of the operation is 2 cycles, but nothing else is changed.

Some x86 instructions might be very complicated and cannot be translated into a fixed-length series of micro-ops using a template. For example, the integer division, square root or the string compare machine instructions might be loops with conditionals in them and don't run the same amount of micro-ops every time. They can be implemented by Intel using a program written in micro-ops. Intel stores this program in flash on the CPU and the decoder knows to run that program when encountering the instruction. The OS doesn't need to help here, this is not emulation or software-floating point, it's just that the single instruction takes 200 clock cycles. What this does to the out-of-order engine is another story. These "programs", called microcode, can have bugs and newer versions of microcode updates, sent to the CPU at boot by the BIOS/UEFI and/or by the OS, update them.

https://en.wikichip.org/wiki/macro-operation

https://en.wikichip.org/wiki/micro-operation

https://en.wikipedia.org/wiki/Microcode




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: