In the end, I'm thinking most of these are related to branch prediction?
It strikes me that it's either that branch prediction is so inherently complex enough it's always going to be vulnerable to this and/or it just so defies the way most of us intuitively think about code paths / instruction execution that it's hard to conceive of the edge cases until too late?
At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
More generally, most of them are related to speculative execution, where branch mis-prediction is a common gadget to induce speculative mis-execution.
Speculation is hard, it's sort of akin to the idea of introducing multithreading into a program, you are explicitly choosing to tilt at the windmill of pure technical correctness because in a highly concurrent application every error will occur fairly routinely. Speculation is great too, in combination with out-of-order execution it's a multithreading-like boon to overall performance, because now you can resolve several chunks of code in parallel instead of one at a time. It's just also a minefield of correctness issues, but the alternative would be losing something like the equivalent of 10 years of performance gains (going back to like ARM A53 performance).
The recent thing is that "observably correct" needs to include timings. If you can just guess at what the data might be, and the program runs faster if you're correct, that's basically the same thing as reading the data by another means. It's a timing oracle attack.
(in this case AMD just fucked up though, there's no timing attack, this is just implemented wrong and this instruction can speculate against changes that haven't propagated to other parts of the pipeline yet)
The cache is the other problem, modern processors are built with every tenant sharing this single big L3 cache and it turns out that it also needs to be proof against timing attacks for data present in the cache too.
> At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
Never for branch prediction. It just gets you too much performance. If it becomes too much of a problem, the solution is greater isolation of workloads.
In certain cases isolation and simplicity overlap, I suspect for example that the dangers of SMT implementation complexity are part of why Apple didn't implement it for their respective CPUs. Likely we'll see this elsewhere too, for example Amazon may not ever push to have SMT in their Graviton chips (the early generations are off the shelf cores from ARM where they didn't have a readily available choice).
I could be mistaken, but I don't think Zenbleed has anything to do with SMT, based on my reading of the document. There is a mention of hyperthreads sharing the same physical registers, but you can spy on anything happening on the same physical core, because the register file is shared across the whole core.
It even says so in the document:
Note that it is not sufficient to disable SMT.
Apple's chips don't have this vulnerability, but it's not because they don't have SMT. They just didn't write this particular defect into their CPU implementation.
Correct, I was responding to parent writing "At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?"
I think we may be seeing an industry-wide shift away from SMT because the performance penalty is small and the complexity cost is high, if so that fits parent's speculation about the trend. In a narrow sense Zenbleed isn't related to SMT but OP's question seems perfectly relevant to me. I come from a security background and on average more complicated == less secure because engineering resources are finite and it's just harder and more work to make complicated things correct.
Not really if that's an attack you're concerned about, because guests can attack the hypervisor via the same mechanisms. You would need to gang schedule to ensure all threads of a core were only either in host or guest.
>At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?
Basically never for anything that's at all CPU-bound, that growth in complexity is really the only thing that's been powering single-threaded CPU performance improvements since Dennard scaling stopped in about 2006 (and by that time they were already plenty complex: by the late 90s and early 2000's x86 CPUs were firmly superscalar, out-of-order, branch-predicting and speculative executing devices). If your workload can be made fast without needing that stuff (i.e. no branches and easily parallelised), you're probably using a GPU instead nowadays.
You can rent one of the Atom Kimsufi boxes (N2800) to experience first hand a cpu with no speculative execution. The performance is dire, but at least it hasn’t gotten worse over the years - they are immune to just about everything
We demanded more performance and we got what we demanded. I doubt manufacturers are going to walk back on branch prediction no matter how flawed it is. They'll add some more mitigations and features which will be broken-on-arrival.
I didn't demand more performance. My 2008-era AthlonX2 would still be relevant if web browsers hadn't gotten so bloated. I still use it for real desktop applications, i.e. everything that isn't in Electron.
Theres VLIW/'preprediction'/some other technical name I forget for infrastructures which instead ask you to explicitly schedule instruction/data/branch prediction. If I remember, the two biggest examples I can think of were IA64 and Alpha. I wanna think HP-PA did the same but I'm not clear on that one.
For various reasons, all these infras eventually lost out in the market due to market pressure (and cost/watt/IPC, I guess).
It strikes me that it's either that branch prediction is so inherently complex enough it's always going to be vulnerable to this and/or it just so defies the way most of us intuitively think about code paths / instruction execution that it's hard to conceive of the edge cases until too late?
At what point does the complexity of CPU architectures become so difficult to reason about that we just accept the performance penalty of keeping it simpler?