I worked on this problem for the past year at Google. It's a fascinating problem. In my subarea I focused on accelerators (like GPUs) running machine learning training.
Many users report problems like "NaN" during training- at some point, the gradients blow up and the job crashes. Sometimes these are caused by specific examples, or numerical errors on the part of the model developer, but sometimes, they are the result of errors from bad cores (during matrix multiplication, embedding lookup, vector op, whatever).
ML is usually pretty tolerant of small amounts of added noise (especially if it's got nice statistcal properties) and some training jobs will ride through a ton of uncorrected and undetected errors with few problems. It's a very challenging field to work in because it's hard to know if your nan is because of your model or your chip.
Are you able to say anything about the distribution across hardware? E.G. Is there any correlation across such faulty parts with serials/production dates or is it very random/insufficient frequency to say?
- "Mercurial cores are extremely rare" but "we observe on the order of a few mercurial cores per several thousand machines". On average one core per 1000 machines is faulty? That's quite a high rate.
- Vendors surely must know about this? If not by testing then through experiencing the failures in their company servers.
- I've read the whole paper and I see no mention of them even reaching out to vendors about this issue. Their are strong incentives on both sides to solve or mitigate this issue so why aren't they working together?
I highly assume that they must have reached out and reported these issues to, presumably, Intel as the biggest player here. Likely they're just not disclosing these numbers, and generally not many are talking about these and potentially other CPU issue in public due to NDAs. Either way, it's quite amazing to dig down so deep in the production stack that you must conclude that it's the CPU at fault here. I presume academic research might have a hard time on this given the scale needed to run into these issues, but hopefully we'll see more on this research in future.
> Either way, it's quite amazing to dig down so deep in the production stack that you must conclude that it's the CPU at fault here.
Google's internal production stack is much more amenable to that kind of digging than public cloud products:
* You can easily find out what machine a given borg task was running on. In fact, not just your own borg job but anyone's. You can query live state, or you can use Dremel to look up history.
* Similarly, even as a client of Bigtable or Spanner, you can find out the specific tabletservers/spanservers operating on a portion of your database and what machines they're running on. (Not as easy to cross this layer and get to the relevant D servers actually storing the data but I think it's all checksummed here anyway.) If your team has your own partition, you can see tabletserver/spanserver debug logs yourself also.
* There's a convenient frontend for looking up a bunch of diagnostic info for the machine, including failures of borg tasks (were other people's tasks crashing at the same time mine did? what was their crash message?), syslog-level stuff, other machine diagnostics like ECC / MCE errors, and repair history (swapped this DIMM, next attempt will swap this CPU).
It's not unusual for application teams to suspect a machine and basically vote it off the island (I don't want my jobs running here anymore, I cast a vote for it to be repaired / Office Spaced). It's more rare for them to really take the time to really understand the problem in detail like "core 34 sometimes returns incorrect results on this computation", although there's nothing in particular stopping them from doing so (other than lack of expertise and a long list of other things to do). The platforms team gets involved sometimes and really digs in—iirc in one bug they mentioned sending a CPU back to the vendor to examine with an electron microscope.
I'm not sure what lessons that offers for a public cloud where that kind of transparency isn't realistic...
I'm not disclosing anything new. Google's official SRE book, research publications, and conference presentations describe the systems I mentioned in more detail.
That rate sounds too high. Typically scan test gives 99.9% logic coverage. This me as random defects must hit this exact subset of logic to cause a fault undetectable by production test. Given defect rates are low, 1 in 1000 having a fault that got past theses tests seems too high. Unless of course Intel does not use scan test, and has a more functional type test method, though even then I'd imagine they must have a high coverage rate.
The article references Dixit et al. for an example of a root cause investigation of a CEE which is an interesting read: https://arxiv.org/pdf/2102.11245.pdf
> After a few iterations, it became obvious that the computation of 𝐼𝑛𝑡(1.153)=0 as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set 𝐼𝑛𝑡(1.152)=142 the result was accurate.
I'd love to see more details on the defective parts, particularly counts of CPU model (anonymized if needs be) and counts of which part of the architecture exhibited faults.
From working in HPC I've handled reports of things like FMA units producing incorrect results or random appearance of NaNs. Were it not for the fact that we knew these things could happen and customer's intimate knowledge of their codes I dread to think how'd "normal" operations would track these issues down. Bad parts went back to the CPU manufacturer and further testing typically confirmed the fault. But that end of the process was pretty much a black box to anyone but the CPU manufacturer. I'd be keen to know more about this too.
Fault tolerance seems to be the fundamental issue looming in the background of both traditional and quantum computing at the moment. Silicon is already at the point where there are only a dozen or so dopant atoms per gate, so a fluctuation of one or two atoms can be enough to impact behavior. It's amazing to me that with billions of transistors things work as well as they do. At some point it might be good to try to re-approach computation from some kind of error-prone analogue of the turing machine.
> Silicon is already at the point where there are only a dozen or so dopant atoms per gate, so a fluctuation of one or two atoms can be enough to impact behavior.
How in the world do they get such a precise number of atoms to land on billions of transistors? It seems so hard for even one transistor.
FWIW, the active channel volume is much more than a dozen atoms in a modern FinFET process.
For a ~5nm process, there might be only a few dozen atoms across the width of the fin, but the other dimensions are much larger, for a total of probably somewhere around hundreds of thousands of atoms per channel.
but regardless, modern semiconductor manufacturing processes are incredible. the much-too-brief summary is that they shine very precise patterns of light on the silicon using very high frequency light to activate photoresist, and then etching away the silicon that isn't protected by the photoresist. this doesn't actually produce features as small as the fins, so there are a lot of tricky techniques, like doing this patterning + etching once, then growing a layer of some other material on top, then etching again to leave only the very narrow sidewalls that grew around the original feature, then etching again using the sidewalls as the pattern.
Yes, materials science, lithography and chip fabrication are the real wizardry of the information age. A 20nm chip is badass on its own yet old school now.
You could probably do something like using idle cores (or idle hyperthreads) to duplicate instructions on an opportunistic basis to verify outcomes on a less than complete basis. There would be thermal and power consequences but some situations care about that less than others.
Unless the idle hyperthreads are on different cores, you'd most likely have the same execution results. Using idle cores could be interesting, but your thermal and power budget would be shared so your overall performance would still decrease.
This is probably difficult to do at a fine grained level, but I imagine that coarser synchronization and checks (both in software) could provide the necessary assurances that code executing on a single core is consistent with that of other cores.
You can have a log of N register “transactions”, where N is large enough to hide core-to-core communication. If any of N transactions fail due to mismatch between cores, you roll back and throw exceptions.
A lot of this logic is already in out of order execution (e.g., Tomasulo algorithm). Memory has ECC and is probably a different problem.
I think it honestly all depends on what the dominant causal factors and how this scales with node size. Effectively, if unreliability increases at the same rate or faster than the performance increase as node size decreases, and 'high reliability' compute can be easily and generally segregated from other compute, then it would probably be easier just to not decrease node size rather than parallelize at the chip/core level. Certainly, the software cost would be much easier.
The economics will never favor this approach. Customers will not choose to pay double to avoid the 1-in-a-million chance of occasionally getting a slightly wrong answer.
Does it have to be double? I know it's not a direct analogue, but parity schemes like RAID 6 or ECC RAM don't double the cost.
So the question is, how do you check these results without actually doing them twice? Is there a role here for frameworks or OS to impose sanity checks? Obviously we already have assertions, but something gentler than a panic, where it says "this is suspect, let's go back and confirm."
> Customers will not choose to pay double to avoid the 1-in-a-million chance of occasionally getting a slightly wrong answer.
With today's high-speed multi-core processors, a 1-in-a-million chance of a computation error would mean tens to hundreds of thousands of errors per second.
I can imagine most consumers that do any sort of work with their computer would appreciate close to 100% stability when they need to get work done.
That's usually why no one that depends on their computers to work day in and day out overclocks their components. The marginal performance gains aren't worth the added unreliability and added power/heat/noise footprint.
The lack of adoption/demand for just ECC RAM by consumers would seem to be an argument in the opposing direction. (Yes, it’s not widely available currently, but I think it’s safe to safe that availability is driven by predictions about adoption given past market behavior.)
Who decided there is a lack of consumer demand? There is a lack of OEM demand for sure which is driven by the fact that most companies are willing to sell crap if it can save 1 cent per product. The average consumer does not even know this pb exists. Adding to that the artificial market seg by Intel which is absolutely stupid and the consummer actually can not buy a consummer CPU that supports ECC. The situation is then locked into a vicious circle where all the components have ridiculous premiums and lower volumes.
Consumers are not the customer. System integrators are the customer. They are motivated to minimize the number of distinct manufacturing targets. Consumers have no choice but to take what is offered.
I think there would be a pretty large hardware cost to ensure the input signals come to both processors at the same clock everytime on the many high speed interfaces a modern CPU is using.
And you'd need to eliminate use of non-deterministic CPU local data, like RDRAND and on die temperature, power, etc sensors. Most likely, you'd want to run the CPUs at fixed clock speed to avoid any differences in settling time when switching speeds.
This could probably effectively fine broken CPUs (although you wouldn't know which of the pair was broken), but you could still have other broken devices resulting in bad computations. It might be better to run calculations on two separate nodes and compare; possibly only for important calculations.
It’s not necessary to serialize the full execution to detect errors. On an out of order processor, there is already buffering of results happening that is eventually serialized to visible state. To check errors, you could just have one buffer per processor and compare results before serialization, raising an error on mismatch between processors. Serialization is merely indicating that both _visible_ executions have agreed up to that instruction but it still allows for some local/internal disagreements. For example, two instructions can finish in opposite orders across cores and that is fine as long as they are serialized in order.
As for settling times, those are random anyway. Processors are binned according to how good the settling times ended up being. It’s unlikely to have two homogeneous chips.
No, it is not. You can always trade off performance for reliability by repeating your computations several times, preferably with some variation in the distribution/timing of work to avoid the same potential hardware failure pattern.
If data integrity issues become a problem, it might be cheaper to mark certain cores as fault-tolerant and provide the OS with mechanisms to denote certain threads as requiring fault-tolerance.
This is a good idea. In a sense this is somewhat available now when choosing whether to run certain things on gpu vs cpu. My understanding is gpus tend to play faster and looser since people don't tend to notice a few occasional weird pixels in a single frame. What if it could be made finer grained by the instruction?
This is one of the reasons people using GPUs for Important Calculations (CAD) spend big bucks on Quadro cards rather than the ostensibly equivalent GeForce. Quadro cards have ECC memory and allow higher precision floats to be used for certain calculations.
> A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core
yielded the identity function, but decryption elsewhere
yielded gibberish.
Depends on the mode. AES-CTR encrypts an increasing counter to make a keystream, then xors that with the plaintext to make the ciphertext, or with the ciphertext to make the plaintext. Any consistent error in encryption will lead to a consistently wrong keystream, which will round-trip successfully.
It's possible other modes have this property because of the structure of the cipher itself, but that's way out of my league.
I'm not sure about that. I think it's just that the AES hardware was busted somehow and didn't actually perform the AES algorithm correctly. The intel AES hardware just deterministically performs the algorithm, so Intel can't just weaken the algorithm somehow, at least if you're not worrying about local side channels.
I don't completely understand the perception that standard non hardened high-perf CPU, especially in an industry and more specifically in a segment that has been reported as consistently cutting a few corners in recent years (maybe somehow less than client CPUs, but still), should somehow be exempt of silent defaults, because... magic?
If you want extremely high reliability, for critical applications, you use other CPUs. Of course, they are slower.
So the only interesting info that remains is that the defect rate seems way too high and maybe the quality decreasing in recent years. In which case, when you are Google, you probably could and should complain (strongly) to your CPU vendors, because likely their testing is lacking and their engineering margins too low... (at least if that's really the silicon that is at fault, and not say for example the MB)
Now of course it's a little late for the existing ones, but still the sudden realization that "OMG CPU do sometimes fail, with a variety of modes, and for a variety of reasons" (including, surprise(?!), aging) seems, if not concentrating on the defect rate, naïve. And the potential risk of sometimes having high rate errors was already a very well known esp. in the presence of software changes and/or heterogenous software and/or heterogenous hardware, due to the existence of logical CPU bugs, sometimes also resulting in silent data corruption, and sometimes also with non-deterministic-like behaviors (so can as well work on a core but not another because of "random" memory controller pressure and delays, and the next time with the two cores reversed)
I think, the main point is: We have reached a time in which there are no guarantees anymore that your HW works (until recently we only had no guarantee that the SW works).
ECC memory was deprecated in consumer machines 15 years ago or more. This was a conscious industry choice that reliability in hardware could be sacrificed to other concerns. That's just an example.
People use ECC to protect against misbehavior that is random both spatially and temporally. That is, it's not meant to protect against the same transistors producing incorrect outputs consistently/systematically. Put another way, we felt we had a rather safe guarantee that consistent/systematic misbehavior of the same portion of hardware would be either testable (like with memory diagnostics) or nonexistent. This paper is tearing apart that assumption.
Because benchmarks sale and are easy to perform whereas characterising spurious failures is hard and unreliable CPUs are not in the interest of users of server products (and arguably not in the interest of users of other products as well but I can see gamers willing to trade stability for marginal perf improvement)
These periodic self tests are required for some safety critical applications. My reaction to this paper is that this approach might have to be used in data centers as well. Unfortunately, it can't be done without help from the CPU designers because the test sensitivity relies on knowledge of the exact implementation details of the underlying hardware units.
This is fascinating. I feel like the most straightforward (but hardly efficient) solution is to provide a way for kernels to ask CPUs to "mirror" pairs of cores, and have the CPUs internally check that the behaviors are identical? Seems like a good way to avoid large scale data corruption until we develop better techniques...
Yeah I didn't know! And I just realized this is mentioned in the paper just a little further below where I paused. It seems like it would significantly affect anything shared (like L3 cache)... would Intel and AMD have appetite for adding this kind of thing to x86?
The pair in lockstep is "close", in that it only includes the core and deterministic private resources like core private caches. Shared resources like a L3 cache are outside of the whole pair, and can be seen as accessed by the pair. All output is from the pair and checked for consistency (same for both cores in lockstep) before going out.
Not directly related but some platforms supporting lockstep are flexible: you can use a pair as either 2 cores (perf) or a single logical one (lockstep).
This might not be a constructive observation for me to post this comment, but I can just see the IBM mainframe designers sitting back with a refreshing beverage, while we talk about identifying and handling hardware faults during runtime.
I wonder about the larger feedback loops between hardware error checking in software and the optimizations hardware manufacturers are making at the fab. Presumably more robust software would result in buggier cores being shipped, but would this actually result in more net computation per dollar spent on processors?
Many users report problems like "NaN" during training- at some point, the gradients blow up and the job crashes. Sometimes these are caused by specific examples, or numerical errors on the part of the model developer, but sometimes, they are the result of errors from bad cores (during matrix multiplication, embedding lookup, vector op, whatever).
ML is usually pretty tolerant of small amounts of added noise (especially if it's got nice statistcal properties) and some training jobs will ride through a ton of uncorrected and undetected errors with few problems. It's a very challenging field to work in because it's hard to know if your nan is because of your model or your chip.