Error correction in CPUs is generally limited to the cache, and its incidence is...

jokoon · on June 15, 2020

From the article:

> “For example, microprocessor degradation may lead to lower performance, necessitating a slowdown, but not necessary failures

rcxdude · on June 15, 2020

It may be necessary for the micro to run slower in order to be stable, but to my knowledge no system for making that adjustment automatically exists in the vast majority of systems. The main problem being it's hard to detect. How do you tell if the CPU is on the margin of failing without a huge amount of extra circuitry? It can be hard enough to detect that it has had a fault. It's not due to lack of interest: such sensing approaches have been patented before, but don't seem to have made it out of the R&D lab.

jokoon · on June 15, 2020

"to my knowledge"

CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?

> How do you tell if the CPU is on the margin of failing

It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.

CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.

Although a CPU that is very old will be very slow, or just crash the computer again and again that hardware-people will just toss the whole thing, since they're not really trained or taught to diagnose if it's the CPU, the RAM, the capacitors, the GPU, the motherboard, etc. In general they will tell their customers "it's not compatible with new software anymore". In the end, most CPUs get tossed out anyway.

It's also a matter of planned obsolescence. Maintaining sales is vital, so having a product that a limited lifespan is important if manufacturers want to hold the market.

rcxdude · on June 15, 2020

> CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?

If such a mechanism existing it would be documented at at least a high level and its effects observable under controlled tests. Neither are, in contrast to the power and temperature envelopes I mentioned. There is no actual evidence that aged chips operating with the same clockrate perform computation slower, your subjective experience that hardware 'slows down' does not count.

> It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.

> CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.

This is not how consumer CPUs work. It's not even how high-reliability CPUs necessarily work (some work through a high level of redundancy but they don't generally automatically retry operations when a failure happens: that's a great way of getting stuck). Such redundancy is so incredibly expensive from a power and chip area point of view that no CPU vendor would be competetive in the market with a CPU which worked like you describe. If a single gate fails in a CPU, the effects can range from unnoticable to halt-and-catch-fire.

The only error correction which is present is memory based, where errors are more common and ECC can be implemented relatively cheaply compared to error checking computations.

jokoon · on June 15, 2020

> If such a mechanism existing it would be documented

Why would it? It's an internal functionality, and CPU usually have a 1 year warranty or so, and I'm not sure they really have guaranteed FLOPS, only frequency I guess. If it's tightly coupled to trade secrets, I would not expect this to be documented. I also doubt that you could find everything you want to know in a CPU documentation.

> There is no actual evidence

The wikipedia article I mentioned, physics is enough evidence.

> If a single gate fails in a CPU

I did not say fail, I meant "miscalculated". There is a very low probability of it happening, but it can still happen because of the high quantity of transistors, hence error correction.

> Such redundancy is so incredibly expensive from a power and chip area point of view

Sure it is, so what? At one point all CPU need it and it becomes necessary. There are billions (I think?) of transistors on a CPU.

rcxdude · on June 15, 2020

Documentation is light on details, but both major CPU vendors give extensive documentation on the performance attributes of their processors, such as how many cycles an instruction may take to complete, and none see fit to mention once 'may take an arbitrary amount longer as the CPU ages'. Not to mention, these performance attributes are frequently measured by reasearchers and engineers, and such an effect as instructions taking more cycles on one sample compared to another from the same batch has yet to be observed (and it's notable and noted when it does differ, e.g. from different steppings or microcode versions). At least one of the many many people who investigate this in great detail would have commented on it.

The wikipedia article you linked makes zero mention of redundant gates as a workaround for reliability issues. The only thing close is that designers must consider it, but this is design at the level of the geometry of the chip, not its logic. It doesn't even make good sense as a strategy: the extra cost of redundant logic to work around reliability issues on a smaller node will outweigh the advantages of that node.

One of the greatest things about modern CPUs is how reliably they do work given that you need such a high yield on individual transistors.

jokoon · on June 15, 2020

Thanks for convincing me!