This is fascinating. I feel like the most straightforward (but hardly efficient)...

jacques_chester · on June 3, 2021

Tandem used to do this. By descent the technology wound up with HPE.

Their Tech Reports are worth a sample and fortunately they're online: https://www.hpl.hp.com/hplabs/index/Tandem

Probably the best one to start at: https://www.hpl.hp.com/techreports/tandem/TR-90.5.pdf

electricshampo1 · on June 3, 2021

Thanks for the reference.

smallpipe · on June 3, 2021

That’s called dual core lockstep and it’s very common in automotive and other applications where reliability is paramount.

dataflow · on June 3, 2021

Yeah I didn't know! And I just realized this is mentioned in the paper just a little further below where I paused. It seems like it would significantly affect anything shared (like L3 cache)... would Intel and AMD have appetite for adding this kind of thing to x86?

yaantc · on June 3, 2021

The pair in lockstep is "close", in that it only includes the core and deterministic private resources like core private caches. Shared resources like a L3 cache are outside of the whole pair, and can be seen as accessed by the pair. All output is from the pair and checked for consistency (same for both cores in lockstep) before going out.

Not directly related but some platforms supporting lockstep are flexible: you can use a pair as either 2 cores (perf) or a single logical one (lockstep).

meepmorp · on June 3, 2021

Mainframes do this. They'll also disable the failing CPUs and place a service call to IBM to get someone to swap out the part.

dataflow · on June 3, 2021

Wow that's cool. It'd be quite interesting if the conclusion ends up being that we should go back to mainframes...