The original post has been updated as more facts came along. It verified that the bug was reproducible on other machines. And then it said:
Update 3/16/2017:
As much as I had least expected this to be the case, this appears to have been confirmed as an errata in the AMD Zen processor.
And then it goes on to say:
Fortunately, it's one that is fixable with a microcode update and will not result in something catastrophic like a recall or the disabling of features.
Basically, this is an awesome bug report. HN should link to it rather than the techpowerup article.
ah! the author of that hwbot.org post is Alex Yee; I remember his name as he wrote that insanely upvoted (and awesome) stackoverflow answer on branch misprediction: http://stackoverflow.com/a/11227902
Can concur - multiple comments in the linked article show confusion as as to why this is even a problem "because the benchmark program in question has CPU-series-specific binaries, and it didn't release one for Ryzen yet" (paraphrasing).
Btw everybody, if you notice something like this or see a comment by someone who has, and you have a minute to drop us a note at hn@ycombinator.com, it's a real community service if you do so. Then we can change the URL (or title or what have you) sooner than 19 hours in—or rather, sooner than never, since we only found out about this one by a user telling us. We can't come close to reading all the comments but we do see all the emails and (usually) act on the time-sensitive ones quickly even if we can't reply till a bit later.
It's crazy how today we can just release microcode updates to patch the CPU, compared to the original Pentium bug which requires the operating system to be patched.
As you can see in the erratum workarounds, most of those bugs were not fixed by microcode updates, but by BIOS updates (mostly changing voltages or timings), and some require OS and compiler updates.
I'm presuming it's just bios update to disable certain optimizations within the CPU.. meaning they were already in place to be A/B tested.. just flipping bit on/off.
I don't think this is true. I'm not an expert on application scale CPUs, but I think they actually have ROMs that control state machines that execute the instructions. This is similar to how regular instructions are stored in memory.
// Minimum Microcode file.
cond OP:4;
cond uaddr:3;
signal /MEMIO = ....1;
signal MREAD = ...1.;
signal MWRITE = ..1..;
field REGMOVE = XX___;
signal MAR <- PC = 01...;
signal IR <- MEM = 10...;
start OP=XXXX; // A tiny fetch instruction
/MEMIO, MREAD, MAR <- PC; //
hold; // Same as MEMIO, MREAD, MAR <- PC
hold, -MREAD, IR <- MEM; // Same as MEMIO, MAR <- PC, IR <- MEM
// End of file.
My understanding is that there are several forms of microcode updates.
Sometimes the microcode can do simple patches like inserting a nop between problematic instruction combinations. This usually doesn't have much (if any) negative performance implications. The instruction decoder (again AFAIK) is fairly programmable.
However if there is a more serious flaw in the actual silicon the microcode must rewrite the problem instructions to emulate the correct behavior and that can make it extremely slow - think emulating floating point but not quite as bad.
"If your system does not use SERIRQ and BIOS puts SERIRQ in Quiet-Mode, then the weak external pull up resistor is not required. All other cases must implement an external pull-up resistor, 8.2k to 10k, tied to 3.3V"
Microcode is software running on your CPU. And many problems can be fixed with a microcode update, but not all - see the workarounds listed in the PDF I linked - note especially how many say "None identified".
Hard to say if something like this would be patchable via a microcode update. But regardless, back then your CPU didn’t run software to run your software, so a new hardware revision was in order.
They do tremendous amounts of validation. I believe random generation of input data is part of that.
Here's an old heavily cited paper from Intel on the topic; I'm sure their state of the art has advanced considerably in the intervening 17 years since its publication:
"crashme" is one venerable program that does this kind of fuzzing -- I managed to find a bug in a cpu with it once, which does not speak well of their QA department.
A former coworker was a QA manager at Intel. He said it was an explicit decision to cut back on validation and QA, which is why he wasn't at Intel anymore. The general feeling was that they had "overreacted" to the Pentium FDIV bug and needed to move faster.
Broadcom Sibyte 1250. I had pre-release silicon, and there was a known bug that prefetch would occasionally hang the chip. I wanted to have a little fun goofing around, so I modified crashme to replace prefetches in the randomly generated code with a noop. A few minutes later, I hung the chip.
If I identified the right erratum later, it was: if there is a branch in the delay slot of a branch in a delay slot which is mispredicted, the cpu hangs. It was fixed by making this sequence throw an illegal instruction exception. (It was undefined behavior already, I think.)
Interesting. And... the nesting you describe is messing with my head! :)
How did the erratum get fixed? Hardware re-re-release?
Definitely filed crashme away, sounds like a useful tool.
The Sibyte 1250 sounds cute. There's a "650000000 Hz" in https://lists.freebsd.org/pipermail/freebsd-mips/2009-Septem... although I'm not sure if MIPS from circa 2002 (?) was that fast (completely ignorant; no field knowledge - definitely want to learn more though).
I also noted the existence of the BCM91250E via http://www.prnewswire.com/news-releases/broadcom-announces-n...
sibytetm-bcm1250-mipsr-processor-to-accelerate-customers-software-and-hardware-development-75841437.html, which was kind of cool. I like how the chip is a supported target for QNX :)
Now I'm wondering what you were using it for. Some kind of network device? (I think I saw networking as one of its target markets.)
--
As an aside, I'm also very curious what HN notifier you use, if you use one. (I use http://hnreplies.com/ myself, but it's sometimes slow. I saw your message after 4 minutes in this case (fast for HN Replies); typing/research + IRL stuff = delay :) )
Our company was PathScale, and we were hoping that the Sibyte 1250 would make a great supercomputing chip. Opteron wasn't out yet, Intel was pricing 64-bit Itanium very high, and this dual-core Sibyte thing was projected to have a reasonable clock and it could do 2 64-bit fp mul-adds/cycle. We had Fred Chow and the GPLed SGI Compiler on our side. And the next revision of the chip had these great 10 gig networking features that I thought I could make work well with MPI, the scientific computing message-passing standard.
You can guess how it worked out: Sibyte was late, slow, and buggy. Even simple code sequences like hand-paired multiply-adds would start running at 1/2 speed after an interrupt. Our experienced compiler team was unable to get good perf on several SPECcpu benchmarks despite the code looking good. (Fred didn't have much hair left to pull out!)
Soon after we raised our A round we pivoted to using Opteron for compute, building an ASIC for a specialized MPI network, and Fred's team did an Opteron backend for the compiler.
The descendant of the network is now called Intel Omni-Path, and is on-package for Xeon Phi cpus.
I see. Always poignant to hear these kinds of stories.
I'm curious as to whether interrupt handling was done underneath the multiprocessing layer or whether interrupts were just hammering the pipeline design. (I assume by "an interrupt" you're referring to slowdown within the fraction of a second after a given interrupt occurred, within the context of floods of interrupts interspersed between instructions?)
Very cool to hear that Intel snapped up what you eventually managed to ship, FWIW - and that you were able to pivot in the way you did. Also interesting to hear about Opteron use in the field, my experience is only with tracking the consumer sector.
Unfortunately fuzzing ultimately has a random component
An instruction which accepts 2 registers and returns 1 register has a 192bit problem space to validate. This complexity is present in an instruction as simple as `add`.
As AVX2 instructions which accepts 3 registers and outputs 1 has a 1024bit problem space to validate.
This occurred in FMA3 with a ~512bit problem space.
Repeat for _every_ instruction (HUNDREDS). You can see how a few bugs slip though the cracks. The problem space is as large as some cryptographic functions!! I'm honestly surprised we don't see more of them.
The specific result produced by the data path is probably not very relevant in the case of a lockup. The control path involved with instruction decoding, register renaming, out-of-order execution, SMT, etc. is generally the cause of issues like this. With interactions between different blocks of the CPU and the size of some of the data structures involved, the full verification space is much, much larger.
I don't know about that. As I understand it, interesting things can happen if a large number data lines toggle at the same time, and this obviously depends on both the data and the control path. Huge space of possible states.
If you read the source (TechPowerUp is terrible why it isn't banned is beyond me) it is actually an _illegal instruction_ error. Which just crashes the current application.
Also if SMT is not disable the error doesn't occur. Also if the chip isn't over clocked the error doesn't occur.
Intel's AVX2 instructions don't crash the machine, but lead to an extreme increase in voltage and, thus, temperature. Sustained use of these instructions (say, for example, in signal processing and iterative parameter estimation algorithms implemented using routines in Intel's MKL) requires cooling way beyond what the usual boxed cooler that comes with an i7 CPU can deliver in order to prevent throttling.
Sort of? If you can actually increase your performance architecturally by doubling your vector width then you double the power dissipated by the relevant bits of your execution and memory systems, true. But doubling performance by doubling power is really awesome. Things like your out of order window size will tend to increase power as the square of their size. And in your typical desktop cpu regime your power scales with cube of frequency.
So if you're lucky enough to have code that can benefit form 256 bit vectors then AVX will double your performance and power then the thermal management system will throttle you back down to regular power and 80% of your regular clock speed for a net of merely a 60% increase in performance. Which is really nice if you happen to be doing all 256 bit wide vector math. If you're only infrequently using vectors your vector speedup will be smaller but so will the increase in power that needs to be overcome so it's still a net win.
This is the main reason why, for example, the i7-7700K comes with a "base clock" of 4.2 GHz even though for most workloads the CPU will operate at 4.4 or 4.5 GHz. That 4.2 GHz represents the maximum speed the CPU can maintain when running the worst case AVX2 workload without exceeding the specified 91W TDP.
Interesting... do you mind me asking: how does overclocking works these days then? We have P-states, i.e. the CPU will go from 1200Mhz... 2400Mhz to 4Ghz, then there's "trubo" frequency range. So does modern overclocking apply some kind of multiplier to all these steps? Thanks.
There's a base clock that's usually something like 100MHz and can usually be tweaked by a few percent. The base clock affects all power states, and usually the memory and I/O (which is usually the limiting factor).
On top of that is a CPU frequency multiplier that is variable and the upper limit is unlocked on certain processor and motherboard combinations. If you have an unlocked multiplier you can set the maximum multiplier for each of the Turbo states (1-core, 2-core, through all-cores). You could configure them to all be the same multiplier, or just scale up the default behavior of running at a higher clock speed when fewer cores are active.
You're not wrong - the Vcore boost when both AVX pipes are powered up causes increased dissipation, but a large part of that is also caused by ... well, AVX processing huge amounts of data.
For similar reasons server processors (Xeon E5, E7 series) throttle the core clock in AVX "mode".
Probably part of why (the other obvious reason being die space) Ryzen lacks 256-bit SIMD hardware, and instead runs such instructions with a 128-bit unit over two cycles.
Well doing that tends to halve throughput for ALU-dense code, but obviously saves a bit of die space and more importantly reduces peak power draw, which has to be supported by the metallization on the chip.
AVX2 might have also been just a bit too late to be integrated outside the microcode.
Desktop parts have no dedicated AVX clock (however, their base vs. all-core turbo pretty much does the same thing), while server SKUs (Socket R3 and later) basically have two distinct clocking ladders, one for non-AVX code and one for AVX code (the "mode switch" quantum is rather coarse, on the order of one millisecond). So both base clock and maximum turbo clocks are lower in AVX code compared to non-AVX code.
As usual, actual achieved clock speeds depend on thermal performance and are not fixed. If cooling is completely insufficient, the clock frequency will drop below base clocks, i.e. thermal throttling.
To clarify: This isn't a secret, but part of the specs and even in some Intel slides. (However, platform and processor specs, even at a basic (product D/S) level, tend to be a rather lengthy read, so most do away with reading all that).
Err, no, actually. it really is clocked down, and that's what intel says.
There is the normal base frequency (which is, again, the one people see on the box. Intel calls it "marked TDP frequency")
There is the avx base frequency, which is less.
If you execute all avx workloads, it will clock down to the avx frequency, which is below the marked tdp frequency.
They are very explicit about this: "Workloads using Intel AVX instructions may reduce processor frequency as
far down as the AVX base frequency to stay within TDP limits."
The security implications of this as commented on the article seems pretty bad. Wonder if any microcode fix can work or it would need a silicon level fix. The second I presume would be very bad?
AMD already has a fix, it requires a BIOS update for the motherboard. They have not given details, but some experts think the fix is merely an increase in the power delivered to that part of the chip. Their conclusion stems from teh fact that the FMA3 bug disappeared when the chip was over-clocked because the over-clocking required increasing the voltage to the chip. So the BIOS update probably adjust the voltages up a bit.
Since they are pushing how efficiently the chip are regulating power to various parts of the chip to keep power draw to a minimum it sounds like a likely cause.
Assuming most of the testing budget is spent on the most common and likely instruction sequence patterns is probably not unrealistic. Still a bit odd that such a short sequence of repeats of the same instruction failed.
Which got me thinking.
At more than 2000 opcodes for x86-64 and legacy opcodes, testing all 3-sequences with one input is more than 8G sequences, to cover that and (guesstimated average) 64 bits of input per opcode is a staggering 10^29 number of instructions, or 2T CPU*year at 4Gops/s.
Not easy to test all that, even if it would be off by a couple of magnitudes in the right direction!
Might be that valid 2-sequences is the most that can hope to be even close to exhaustively tested while at the same time covering some significant fraction of the operand space ?
The fact that modern CPU's work at all given their complexity in so many different areas looks pretty darn close to magic, even from a pretty close distance.
A power issue would also plausibly explain why disabling SMT avoids the problem, since disabling SMT powers off a bunch of stuff (a whole thread context does need a bit of power), and, in a multi-threaded scenario, tends to reduce core usage thereby reducing average (! so likely not relevant) power draw as well.
And (probably more importantly) reduces overall IPC by around 30% generally. The whole point of SMT these days is to have continuous work for the ALUs despite recent cache misses.
If it's not fixable in microcode, then yeah, I'd say it's pretty bad. Makes it useless as a cloud machine, if anyone can cause a DOS and take out every other VM on a host.
For a personal machine, it's probably not terrible if you are using linux or something else where you can compile everything yourself. But running binaries built by someone else would be a crapshoot. I wonder how many games (of the Windows, AAA variety) use these multiply instructions?
There's also not a lot of detail in the article—like if the data has to be specific or just the combination of instructions is enough to crash it.
For many cloud vendors, including EC2, "dedicated hardware" just means yours will be the only virtual machine on that computer. But they're still running your image under a hypervisor.
It would still be a pain for cloud providers if the tenant can unwillingly DoS the hypervisor. Sure they have watchdogs in place, but it gets more complicated if the DoS can be non-malicious.
I suspect Amazon (for instance) has far more shared machines than dedicated machines. If something isn't viable for shared machines then a huge section of the market is ripped away. Still seems catastrophic to me.
> Flops is only affected when the SMT is enabled, so disabling the SMT can be used as a temporary work-around (until the actual fix arrives).
Does this mean disabling SMT will also fix the bug, or is that specific to this app?
I've heard that some things improve performance on Ryzen with SMT off, but I've heard that's because OS level task schedulers need to be optimized better. But I still wonder if Ryzen's SMT implementation is on par with Intel's first implementations.
Haswell and early Broadwell had a bug in its transactional memory extensions so serious that the microcode "fix" simply permanently disabled the instructions.
While those are the most remembered bugs, almost every CPU have some kinda of CPU errata that can lock up the system in really weird and specific ways. For example, my Intel Ivy Bridge CPU running Linux:
$ dmesg | grep -i microcode
[ 0.000000] microcode: microcode updated early to revision 0x1c, date = 2015-02-26
The fact that most users does not hit those bugs is because modern OS already patches those microcode before execution (like the case above).
Of course, if AMD can't fixes those bugs without performance regressions (remember the infamous TLB Bug from earlier Phenoms?) it can be pretty bad. However I don't think the majority of users needs to be too cautious about it.
Yeah, that was kind of unfortunate as it made the Intel Quark incompatible with all normal Linux distros. I don't think it even had any kind of microcode update facility, being effectively a slightly updated 486.
Even with a runtime microcode update facility, no BIOS update equivalent would still make it pretty fatal for packaging except as an entirely distinct build target, because otherwise you'd need to do something kinky like always first booting a custom kernel+initrd target built to involve no locks, load the microcode update, then kexec into the real kernel.
I don't know if this bug affects Phenom II chips, but I've been running a Phenom II for 8 straight years in my primary desktop with nary a hiccup. Rock solid reliable. Guess I haven't been running the "correct conditions" for a crash?
You could also have a BIOS update with the workaround already baked in, or a chip that has the later silicon without the bug...
(I also had a Phenom II for many years without any mysterious crashing problems, even though the chip was the correct revision and there was no BIOS erratum workaround enabled.)
(While it doesn't apply for this specific example, it's still the reason why one generally will see microcode-fixed errata go away rather quickly, independently of BIOS updates)
FMA3 (fused multiply add) instructions are very common for any kind of audio processing (e.g. filters), not only for their speed but also their precision properties. Depending on how a game was compiled, these instructions would be all over the place.
But if this specific sequence is every run, that's another question, of course.
FMA is used in literally everything that deals with floats and has AVX2/FMA3 support, not just audio. AVX usage in game code isn't actually all that common, though, due to the need to support lowest common denominators.
And no, this specific sequence isn't ever run. If it were, they'd have found it while designing the chip. I checked the original source and it's hand written FMA3 intrinsics that don't correspond to a real computation.
WebAssembly does not currently have an fma instruction, and implementations are not permitted to fold plain multiply+add sequences into fma because that produces different rounding.
Some game developers that had optimized with AVX were asked to take it out since it would crash overclocked CPUs that were riding the limits. Hit a patch of AVX and bam, they'd fall over.
Intel has mostly fixed this in more recent chips so that now they look ahead for upcoming AVX instructions and just slow down.
FMA (fused multiply-add) is a basic computing operation that is a basic building block in tasks such as evaluating polynomials, vector-vector and matrix-vector multiplication, convolutions, and also algorithms to solve nonlinear equations.
You'd be hard pressed to find any computing application that doesn't take advantage of FMA instructions.
Can someone please at fix the title? Crashes -> Freezes.
Big difference.
And summarized: Intel has about 80 known similar problems as outlined in the errata, for which no known fixes exist. For this AMD bug a Bios update fix exists (more voltage).
http://forum.hwbot.org/showthread.php?t=167605
The original post has been updated as more facts came along. It verified that the bug was reproducible on other machines. And then it said:
And then it goes on to say: Basically, this is an awesome bug report. HN should link to it rather than the techpowerup article.