also bear in mind that AMD's standards for these "challenges" have always involv...

p1necone · on Oct 21, 2021

To be fair idle power is really important for a lot of use cases.

In a compute focused cloud environment you might be able to have most of your hardware pegged by compute most of the time, but outside of that CPUs spend most of their time either very far under 100% capacity, or totally idle.

In order to actually calculate real efficiency gains you'd probably have to measure power usage under various scenarios though, not just whatever weird math they did here.

throwawaylinux · on Oct 21, 2021

That's not really being fair, because the metric is presented to look like traditional perf/watt. And idle is not so important in supercomputers and cloud compute nodes which get optimized to keep them busy at all costs. But even in cases where it is important, averaging between the two might be reasonable but multiplying the loaded efficiency with the idle efficiency increase is ludicrous. A meaningless unit.

I can't see any possible charitable explanation for this stupidity. MBAs and marketing department run amok.

p1necone · on Oct 21, 2021

Yep 100% agree with you - see my last sentence. Just trying to clarify that the issue here isn't that idle power consumption isn't important, it's the nonsense math.

throwawaylinux · on Oct 21, 2021

Wow that's stupid, I didn't look that closely. So it's really a 5x perf/watt improvement. I assume it will be the same deal for this, around 5-6x perf/watt improvement. Which does make more sense, FP16 should already be pretty well optimized on GPUs today so 30x would be a huge stretch or else require specific fixed function units.

paulmd · on Oct 21, 2021

it's an odd coincidence (there's no reason this number would be related, there's no idle power factor here or anything) but 5x also happens to be about the expected gain from NVIDIA's tensor core implementation in real-world code afaik. Sure they advertise a much higher number but that's a microbenchmark looking at just that specific bit of the code and not the program as a whole.

it's possible that the implication here is similar, that AMD does a tensor accelerator or something and they hit "30x" but you end up with similar speedups to NVIDIA's tensor accelerator implementation.

touisteur · on Oct 21, 2021

I've seen tensor cores really shining in... tensor operations. If your workload can be expressed in convolutions, and are matching the dimensions and batching needs of tensor cores, there's a world of wild performance out there...