Unexpected benefit with Ryzen – reducing power for build server

wolf550e · on Sept 3, 2018

Not every workload is memory bandwidth bound like his "make -j16" compile. Some workloads need memory latency or fast inter-core (and inter-socket) operations (e.g. RDBMS OLTP), some need CPU throughput (e.g. HPC), some need best possible single thread CPU performance (e.g. some gaming).

As he wrote, CPUs are most efficient (compute per Watt) at a specific frequency, and if his CPU mostly waits for RAM, this can be done at low power.

It's probably possible to create x86-64 CPUs with narrower backends (fewer execution units) with microcode-emulated 128 and 256 bit registers/operations (and maybe even emulated FPU) and get a cheaper and faster build server, if it was economical to fab such narrow-use-case chips (those would be good for redis/memcached too I imagine).

nwallin · on Sept 3, 2018

> Not every workload is memory bandwidth bound like his "make -j16" compile.

He actually did `make -j32`, not 16. Which is going to absolutely devastate the cache.

`make -j<number of cores x 2>` was a good rule of thumb back when you had 1/2/4 physical CPUs with their own sockets on a motherboard and spinning rust hard disks. A lot of "compilation" time was reading the source code off the disk. But it doesn't make any sense anymore with so many cores, hyperthreading, and SSDs that serve you the file in milliseconds.

If he's bandwidth limited, he would gain a significant performance improvement by reducing the number of processes.

thyrsus · on Sept 3, 2018

I'd reserve judgement until I saw measurements. Maybe all 32 jobs are using the same cpp/gcc/asm/ld binaries (or whatever the stack is these days) which never get evicted. And I presume this is running under DragonFly BSD whose design goals include aggressive SMP support, so things might be different there. I don't know.

Dylan16807 · on Sept 3, 2018

Zen is already light on vector units, and microcodes 256-bit operations.

It's certainly possible to build a more lightweight core, but most of that work is reducing the complexity of the out-of-order machinery. The FPU+ALU is under a quarter of each Zen core. https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annota...

BeeOnRope · on Sept 4, 2018

It is definitely not "microcoded" - 256-bit operations are just sent in halves to the 128-bit ALUs and combined for the final answer.

Don't get mixed up between "microcoding" and "micro-op" - the latter is something different, slower and which usually requires some kind of transition in the decoders and uop caches to start reading microcoded ops. The latter is the "normal" or "fast" mode for the CPU and just because one instruction turns into two uops (or macro-ops or whatever AMD calls them) doens't mean microcoded.

BeeOnRope · on Sept 5, 2018

Sorry, I meant "the former is something different..."

wolf550e · on Sept 3, 2018

You do want it to out-of-order and branch predict and speculate enough to issue speculative RAM reads as soon as a possibly needed address is available, to hide RAM latency (as long as rollbacks of speculatively executed operations hide the loaded values in the cache so the speculation leaves no side effects), that is important for performance.

In that picture (thanks!) I see the FPU is big, the decoder is big, the branch predictor is big, the rest is probably needed. Maybe emulated FPU is good for some workloads, maybe ability to program in microinstructions instead of x86-64 is useful too. But maybe silicon area is not the expensive thing (dark silicon, etc.).

fulafel · on Sept 3, 2018

Do you have a link about microcoded avx256? I would think it would be way too slow.

CodesInChaos · on Sept 3, 2018

> 256-bit vector instructions (AVX instructions) are split into two micro-ops handling 128 bits each.

https://www.agner.org/optimize/blog/read.php?i=838#838

This is responsible for Ryzen losing to Intel in SIMD heavy benchmarks. The upside is that it avoids the reduced turbo boost Intel does for some 256-bit AVX instructions (and even worse downclocks caused by AVX 512), so for workloads mixing avx and normal instructions it shouldn't do too badly.

fulafel · on Sept 3, 2018

I see, yep - but this is still hardwired stuff happening in instruction decode, as Agner writes, not trapping to microcode sequencing.

nitrogen · on Sept 3, 2018

Is there a good resource that explains the difference between those ("decode" vs. "trapping") on a modern CPU? When I see "trap" I imagine the kernel catching illegal instruction exceptions and emulating them in software, but it doesn't seem like that's what you mean?

wolf550e · on Sept 3, 2018

Modern CPUs execute micro ops, RISC-like instructions (e.g. load from memory address to register, add two registers, store from register to memory address). The CPU's "decode" stage translates x86 instructions into micro-ops (often 1-to-1, but x86 compare followed by x86 jump are translated into a single micro-op, while some x86 instructions are translated into multiple micro-ops).

On one CPU model a x86 operation like "256bit add" might translate into "256bit add" micro-op, and on another model the same x86 operation might be translated into a series of micro-ops like "128bit add, wait a cycle for the 1st add to finish, pass the carry bit into a 2nd 128 bit add", because that model doesn't have a real 256bit adder. So the latency of the operation is 2 cycles, but nothing else is changed.

Some x86 instructions might be very complicated and cannot be translated into a fixed-length series of micro-ops using a template. For example, the integer division, square root or the string compare machine instructions might be loops with conditionals in them and don't run the same amount of micro-ops every time. They can be implemented by Intel using a program written in micro-ops. Intel stores this program in flash on the CPU and the decoder knows to run that program when encountering the instruction. The OS doesn't need to help here, this is not emulation or software-floating point, it's just that the single instruction takes 200 clock cycles. What this does to the out-of-order engine is another story. These "programs", called microcode, can have bugs and newer versions of microcode updates, sent to the CPU at boot by the BIOS/UEFI and/or by the OS, update them.

https://en.wikichip.org/wiki/macro-operation

https://en.wikichip.org/wiki/micro-operation

https://en.wikipedia.org/wiki/Microcode

randomthrow1 · on Sept 4, 2018

I'm interested in performance optimization (especially under linux) and its intersection with computer architecture. Would you mind recommending me any resources to get started there?

BeeOnRope · on Sept 4, 2018

For x86 specifically, Agner Fog's manual, the Intel and AMD optimization manuals, the SO x86 tag wiki [1].

[1] https://stackoverflow.com/tags/x86/info

randomthrow1 · on Sept 4, 2018

That's a great idea, thank you!

mekpro · on Sept 3, 2018

This is to be expected. Since 180Watt is not the default TDP of Ryzen 2700X, the default TDP is 105Watt. https://www.amd.com/en/products/cpu/amd-ryzen-7-2700x

Which mean the CPU is already shipped with the reasonable performance/watt TDP and over-TDP it will give diminish return in performance gain.

However, It would be interesting to see benchmark in much lower TDP than 105Watt and see how far the TDP can go down before big drop in performance.

manual · on Sept 3, 2018

Excellent look into this by user "The Stilt" can be found here: https://forums.anandtech.com/threads/ryzen-strictly-technica...

It looks something like this: 4GHz 120W, 3.8 90W, 3.6 65W, 3.4 50W, 32 42W, 3.0 33W, 2.0 13W. This excludes the SOC.

tgtweak · on Sept 3, 2018

Stilt is a magician in getting peak performance per watt out of everything. Down to tweaking individual straps for memory timing on binary firmwares for amd graphics cards.

3.6 @65W is impressive, almost stock speed at nearly half tdp.

jandrese · on Sept 4, 2018

Basically a 10% performance penalty for a 45% power savings. And you might not even see that 10% performance penalty if you're machine is bottlenecked on memory/storage.

magila · on Sept 3, 2018

Calling this "Unexpected" seems like a bit of a stretch. In particular this part:

> Of course, in the server space, we've known for a long time that maximum efficiency occurs with a high number of cores running at lower frequencies, and that efficiency trumps performance on machines with high core counts. But I never considered that the consumer Ryzen CPUs could also benefit from the same thing until now.

makes no sense. This principal applies to all CPUs from the smallest SoCs to the largest server CPUs, why on Earth would you not expect it to apply to desktop CPUs?

You could do the same thing with a 6 core i7 and 2133 memory. Intel CPUs have long supported an adjustable power limit to constrain operating frequency based on power consumption just like he describes for Ryzen.

ceratopisan · on Sept 3, 2018

You are confusing principle with implementation. Reducing clock speed reduces power usage and you can compensate with more cores is indeed a truth. However, finding that option in consumer hardware has been relatively difficult. That is the surprise indicated.

magila · on Sept 3, 2018

Ryzen is known to be memory constrained even with much faster memory than he used. It is completely predicable that he found his CPU to be severely starved for memory bandwidth thus enabling him to reduce operating frequency without penalty.

This is like putting an LS engine in an otherwise stock Miata and acting surprised that you can run the engine at lower RPM and still put in good lap times.

AstralStorm · on Sept 3, 2018

Are you kidding me? That's only true if the constraint can be removed in a hardware upgrade. Apparently latter day Xeons are not much better at hiding memory latencies than Zen and no longer outrun it as much like they did Bulldozer on other operations which made the latencies irrelevant.

In other words, he's reached peak CPU. As in a faster unit will not speed it up, and more cores can only do that to a point. Amdahl law (power efficiency variant) and also memory controllers say hello.

BeeOnRope · on Sept 3, 2018

Note that the author is not claiming that he compensated with more cores. He is claiming that the performance is roughly the same, at the same core count (8), regardless of frequency.

BeeOnRope · on Sept 3, 2018

Well I suppose the part that might not apply to desktop parts is "efficiency trumps performance". Many desktop uses don't care about efficiency, but care about performance.

For servers, the purchasing decisions are probably much more quantitative, and if you are buying a high core count machine it probably means you have a parallelizable workload, so no efficiency comes into play since you have a lot of choice on the frequency/core-count spectrum, versus power, money, space, etc.

Fnoord · on Sept 5, 2018

> Many desktop uses don't care about efficiency, but care about performance.

Why not? I like it when my electricity bill is less severe. For my wallet and the environment alike.

Jonnax · on Sept 3, 2018

180w to 85w is pretty impressive.

I didn't know that compilation was memory speed limited.

Are there any good benchmarks on it?

Anyone have any examples of getting faster memory boosting build speed?

Over the last few years I'd settled into thinking that high speed ram barely did anything. I guess I was wrong!

lykr0n · on Sept 3, 2018

ryzen loves memory speed, more so than Intel. phoronix has benchmarks that you're looking for.

https://www.phoronix.com/scan.php?page=article&item=ryzen-dd...

gizmo686 · on Sept 3, 2018

More specifically, the interprocessor interconnect (infinity fabric) system ryzen uses is tied to the RAM clock. Ryzen clumps there processosor in groups of 4, and uses infinity fabric as an interconnect between those; so I am not sure you will an effect larger then Intel on a quadcore ryzen.

https://www.techpowerup.com/231585/amd-ryzen-infinity-fabric...

_wmd · on Sept 3, 2018

Anything involving larger-than-cache sized graph walks will usually be memory limited, be it compilation or iterating an XML document

BeeOnRope · on Sept 3, 2018

I think the claim that parallel compilation with gcc is memory bandwidth bound is unlikely. gcc is known to be a very pointer-chasy, branch-mispredicty load that is highly sensitive to memory latency - far from a streaming load that is sensitive to raw bandwidth.

Still, the conclusion holds: if most of the time is spent waiting for values to come back from memory, a higher core frequency has strongly diminishing returns.

imtringued · on Sept 3, 2018

That's only true if you only compile a single file at once which is an exceedingly rare use case for a build server. As soon as you compile files in parallel the CPU can simply switch to the next hardware thread during a memory load from main memory. Then there is the fact that dual channel DDR4 just doesn't provide a lot of memory bandwidth in the first place. A 16 core/32 thread desktop CPU is probably not going to happen on the AM4/Ryzen platform even if everything suddenly supports multi-threading on 16 cores simply because the memory bandwidth isn't enough to translate into meaningful performance increases. GPUs have horrendous memory latencies but they perform well precisely because they can just switch to the next thread and execute that one while waiting.

BeeOnRope · on Sept 4, 2018

Well you are mixing the effect of "more cores" and SMT together here. Sure, SMT helps hide some latency effects, but it doesn't significantly increase the demand for bandwidth. The increased bandwidth requirements when introducing SMT are probably approximately modeled by the increase in performance: so a 30% uplift from running two hardware threads per core means that bandwidth requirement increases by about 30%.

That's not enough to turn gcc from a largely latency bound load to a memory bandwidth hog!

vgatherps · on Sept 3, 2018

Ryzen only has two threads per core, so one would be able to see at most a 2x gain. That's not insignificant, but still far from what one needs to start seeing bandwidth problems.

AstralStorm · on Sept 3, 2018

Closer to zero returns and you still get to improve latency hiding and memory controller design. (including bit width and block sizes)

Even more cache ways won't help too much in this workload.

BeeOnRope · on Sept 4, 2018

Unless the AMD design is unusual it is not very close to zero return: a significant part of the "path to memory" involves things run at the core clock, in particular everything from the core to the L2 and probably some part of the coordination logic which communicates with the "uncore". I'm not sure about AMD chips, but on some chips there is a relationship between the uncore speed and the core speed: e.g., the uncore speed might often be the same as the maximum core speed for any core on the socket.

Adding to that, there are other effects that allow core frequency to leak into the performance of memory-bound programs, such as a higher frequency allowing the core run ahead more quickly to get more memory requests in flight, recover more quickly after a branch misprediction, etc. Try it sometime: find something which is really memory bound and crank the frequency way down: there will probably be a significant effect, but not nearly in proportion to the frequency difference.

yazr · on Sept 3, 2018

I wonder if his results still hold for gcc -O3?

That be much more CPU-bound. Yes - some optimization will do global traversals.

I wonder if javac/clang has the same characteristics as gcc.

BeeOnRope · on Sept 4, 2018

Yeah maybe. I haven't found a huge difference between -O3 and -O2, but maybe I haven't been running big enough compiles.

cestith · on Sept 3, 2018

Short form and generalized: when one subsystem of a larger system is not your bottleneck, it's often possible to lower the resources for that part of the system without impacting overall performance.

ddorian43 · on Sept 3, 2018

Any server-hosting with Ryzen + ecc-ram ?

Nux · on Sept 3, 2018

https://www.hetzner.com/dedicated-rootserver/matrix-ax

ddorian43 · on Sept 3, 2018

I know but it's not ecc-ram.

Nux · on Sept 3, 2018

"128 GB DDR4 ECC RAM" .. sounds pretty ECC to me.

ddorian43 · on Sept 3, 2018

I was talkin about Ryzen 8core more actually. Those have low clock speed. And are called EPYC.

Nux · on Sept 3, 2018

Got it. For what it's worth I think EPYC is based on the same "Zen" arch as Ryzen.

AstralStorm · on Sept 3, 2018

With a bit enhanced memory controller, Zen refresh (2xxx) should do a bit better still.

BeeOnRope · on Sept 3, 2018

Packet.net offers it:

https://www.packet.net/bare-metal/servers/c2-medium-epyc/

afandian · on Sept 4, 2018

Do you have experience with packet.net ? There's virtually no discussion on HN, which I find surprising.

ddorian43 · on Sept 3, 2018

Please see other comments. It's EPYC and not RYZEN. 2.0 Ghz base clock speed is not nice for single-thread.

dman · on Sept 3, 2018

For compilation it is a beast. Built a dual 7551 Epyc workstation for myself recently, it builds llvm in ~160 seconds from scratch. https://openbenchmarking.org/result/1809030-AR-DUAL7551867

morsma · on Sept 4, 2018

God damn that's a beast!

paxswill · on Sept 3, 2018

Epyc is the name for the server chips. Both Ryzen and Epyc are based on the same microarchitecture.

ddorian43 · on Sept 3, 2018

Both based on Zen. Still I wanted the high-clock-core and not the hundred-slow-cores.

AstralStorm · on Sept 3, 2018

In that case, Threadripper is the core that is inbetween.

ebikelaw · on Sept 3, 2018

He doesn't seem to mention what the build times were with the Xeon.

loeg · on Sept 3, 2018

Here's the same author from pretty recently comparing the 2990WX to some Xeons (he says E5-2620 but doesn't mention which version — could be anything from Sandybridge to Broadwell):

http://apollo.backplane.com/DFlyMisc/threadripper.txt

zrm · on Sept 4, 2018

The only available E5-2620 with that number of cores per socket is Broadwell.

polskibus · on Sept 3, 2018

Has anyone encountered webpack and C# compilation benchmarks that compare Ryzen and Intel?

ulzeraj · on Sept 4, 2018

Is this the amiga hacker/dragonflyBSD main dev? Why nobody has handed a 2990wx to this man?