It's weird that I've read so many people say "your computer isn't slow because o...

thebruce87m · on June 15, 2020

As you as you mean “slower than it was” then that statement holds mostly true. Your CPU, RAM and GPU should perform the same on day 1 and day 10000 as long as they are still functional. Any “degradation” won’t make them slower, just non functional.

The complication in this comes from the SSD, where the flash cells have a feedback loop for operations such as erasing that can take longer as these cells degrade.

jokoon · on June 15, 2020

CPU have error correction, which will mitigate transistor aging and make the CPU work slowly instead of not at all.

It will not "perform the same". At some point there is a noticeable slowdown, and even though Wirth's law is at work, it's not the entire story. Heat will also make any chip age faster.

This article talks about aging under 5nm, but aging is already an issue above 5nm. Read the article.

https://en.wikipedia.org/wiki/Electromigration#Practical_imp...

> The complication in this comes from the SSD

I always experienced slowdowns on computer that did not have SSD. Software is not always the only problem.

thebruce87m · on June 15, 2020

Someone else has addressed your other points, but for ECC:

This will not detect an error in computation, only a bit flip in data. The current way to mitigate computation errors is to have two processors to detect an error in computation, or more to do a voting system if mere detection is not good enough. Since you cannot detect a computation error with a single CPU (without overhead somewhere, and thus lower performance), you can’t slow down to fix it.

The ECC systems I have worked with can fix a 1 bit error and detect two or more bit errors. They do this by using an algorithm to convert the original data in to an output that is bigger than the original (e.g. 64 bytes is now 72 bytes). This output data does not make sense until passed through the reversing algorithm. So basically the overhead is zero since the memory controller is running the algorithm in hardware every time anyway, so no slow down.

rcxdude · on June 15, 2020

Error correction in CPUs is generally limited to the cache, and its incidence is recorded: if something had failed permanently such that the error correction path was being taken constantly, you would be able to record it.

Absent a mechanism which reduces the clock speed of the CPU when it becomes unstable, there's no reasonable way in which failures in the CPU will result in it running slower. Such a mechanism doesn't generally exist: modern CPUs regulate their clock but only in response to a fixed power and temperature envelope. The recent iphone throttling is the only notable case where anything was done automatically in response to an unstable CPU, and that consisted of applying a tighter envelope if the system reset.

This is reflected in the experiences of those who run older hardware with contemporary software: it generally still works just fine at the speed that it used to.

jokoon · on June 15, 2020

From the article:

> “For example, microprocessor degradation may lead to lower performance, necessitating a slowdown, but not necessary failures

rcxdude · on June 15, 2020

It may be necessary for the micro to run slower in order to be stable, but to my knowledge no system for making that adjustment automatically exists in the vast majority of systems. The main problem being it's hard to detect. How do you tell if the CPU is on the margin of failing without a huge amount of extra circuitry? It can be hard enough to detect that it has had a fault. It's not due to lack of interest: such sensing approaches have been patented before, but don't seem to have made it out of the R&D lab.

jokoon · on June 15, 2020

"to my knowledge"

CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?

> How do you tell if the CPU is on the margin of failing

It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.

CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.

Although a CPU that is very old will be very slow, or just crash the computer again and again that hardware-people will just toss the whole thing, since they're not really trained or taught to diagnose if it's the CPU, the RAM, the capacitors, the GPU, the motherboard, etc. In general they will tell their customers "it's not compatible with new software anymore". In the end, most CPUs get tossed out anyway.

It's also a matter of planned obsolescence. Maintaining sales is vital, so having a product that a limited lifespan is important if manufacturers want to hold the market.

rcxdude · on June 15, 2020

> CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?

If such a mechanism existing it would be documented at at least a high level and its effects observable under controlled tests. Neither are, in contrast to the power and temperature envelopes I mentioned. There is no actual evidence that aged chips operating with the same clockrate perform computation slower, your subjective experience that hardware 'slows down' does not count.

> It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.

> CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.

This is not how consumer CPUs work. It's not even how high-reliability CPUs necessarily work (some work through a high level of redundancy but they don't generally automatically retry operations when a failure happens: that's a great way of getting stuck). Such redundancy is so incredibly expensive from a power and chip area point of view that no CPU vendor would be competetive in the market with a CPU which worked like you describe. If a single gate fails in a CPU, the effects can range from unnoticable to halt-and-catch-fire.

The only error correction which is present is memory based, where errors are more common and ECC can be implemented relatively cheaply compared to error checking computations.

jokoon · on June 15, 2020

> If such a mechanism existing it would be documented

Why would it? It's an internal functionality, and CPU usually have a 1 year warranty or so, and I'm not sure they really have guaranteed FLOPS, only frequency I guess. If it's tightly coupled to trade secrets, I would not expect this to be documented. I also doubt that you could find everything you want to know in a CPU documentation.

> There is no actual evidence

The wikipedia article I mentioned, physics is enough evidence.

> If a single gate fails in a CPU

I did not say fail, I meant "miscalculated". There is a very low probability of it happening, but it can still happen because of the high quantity of transistors, hence error correction.

> Such redundancy is so incredibly expensive from a power and chip area point of view

Sure it is, so what? At one point all CPU need it and it becomes necessary. There are billions (I think?) of transistors on a CPU.

rcxdude · on June 15, 2020

Documentation is light on details, but both major CPU vendors give extensive documentation on the performance attributes of their processors, such as how many cycles an instruction may take to complete, and none see fit to mention once 'may take an arbitrary amount longer as the CPU ages'. Not to mention, these performance attributes are frequently measured by reasearchers and engineers, and such an effect as instructions taking more cycles on one sample compared to another from the same batch has yet to be observed (and it's notable and noted when it does differ, e.g. from different steppings or microcode versions). At least one of the many many people who investigate this in great detail would have commented on it.

The wikipedia article you linked makes zero mention of redundant gates as a workaround for reliability issues. The only thing close is that designers must consider it, but this is design at the level of the geometry of the chip, not its logic. It doesn't even make good sense as a strategy: the extra cost of redundant logic to work around reliability issues on a smaller node will outweigh the advantages of that node.

One of the greatest things about modern CPUs is how reliably they do work given that you need such a high yield on individual transistors.

jokoon · on June 15, 2020

Thanks for convincing me!