I gave a presentation last year on this for QConLondon (before the lock down) and afterwards for the LJC virtually, if people prefer to listen/watch videos.
I would love a Bartosz Ciechanowski interactive article on microprocessors. It may be outside his domain though, since the visualisations and demo's would be less 3D model design, and more, perhaps, mini simulations of data channels or state machines that you can play through. Registers that can have initial values set, and then you can step through each clock cycle. Add a new component each few paragraphs and see how it all builds up. I did all this at university, but would love a refreshers that is as well made as his other blog posts.
>"One of the most interesting members of the RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register-renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OOO logic either."
PDS: Why do I bet that the Transmeta Crusoe didn't suffer from Spectre -- or any other other x86 cache-based or microcode-based security vulnerabilities that are so prevalent today?
Observation: Intentional hardware backdoors -- would have been difficult to place in Transmeta VLIW processors -- at least in the software-based x86 translation portions of it... Now, are there intentional hardware backdoors in its lower-level VLIW instructions?
I don't know and can't speculate on that...
Nor do I know if the Transmeta Crusoes contained secret deeply embedded "security" cores/processors -- or not...
But secret deeply embedded "security" cores/processors and backdoored VLIW instructions aside -- it would sure be hard as heck for the usual "powers-that-be" -- to be able to create secret/undocumented x86 instructions with side effects/covert communication to lower/secret levels -- and run that code from the Transmeta Crusoe's x86 software interpreter/translator -- especially if the code for the x86 software interpreter/translator -- is open source and throughly reviewed...
In other words, from a pro-security perspective -- there's a lot to be said about architecturally simpler CPU's -- regardless of how slow they might be compared to some of today's super-complex (and, ahem, less secure...) CPU's...
AFAIK Transmeta's core was doing a lot of speculative execution stuff, so if it evolved until today I wouldn't bet on it also gaining issues like Spectre.
But, even if it is... keep in mind that when running x86 instructions, you still have the x86 translation software proxy layer... that means that could could grab any given offending / problem-causing x86 instruction -- when you encountered it, and recode the VLIW output from it to output a different set of native VLIW instructions -- that you knew were safe...
In other words, with a Transmeta Crusoe -- if the x86 translation layer is open source and you possess it (and can code / understand things) -- then you'll have some options there.
Which is unlike a regular x86 CPU -- where the way it decodes and executes instructions -- cannot be changed in any way by the user...
Transmeta Crusoe is unlikely to be vulnerable, but it's Intel contemporaries aren't either. So you'd need to be looking at some hypothetical "today" version.
An open system with such a design would indeed be fascinating (the original wasn't open, and Transmeta was big on their patents on this stuff). More flexibility than microcode patches too.
Elbrus 2000 is a mass-produced VLIW microprocessor with binary translation of x86. There is a set of different models having different number of cores and DSP blocks.
The code morphing software itself is almost certainly a new source of new side channel spectre like attacks. Like being able to tell if something is already in the translation cache via timing.
It would be much easier for a vendor to turn off speculative compilation in that case though, i.e. for HPC you want all the performance, but a cloud vendor could still protect vulnerable interfaces.
Well, again, if the x86 code morphing software is available and open source, and if someone understands this software and can modify it -- then that's infinitely infinitely better (from a security perspective) -- than having to run x86 code directly on a regular AMD/Intel x86 processor...
In the latter case -- you have absolutely no control whatsoever over how the processor interprets and dispatches its x86 instructions...
Oh totally. I'm more sure than not that something like an open source code morphing software could be made more secure against side channel attacks with greater flexibility than is afforded micro code updates.
I'm mainly saying that it's a problem space that both has actively shipping implemetnations (Nvidia Denver), has new levels of cache which affect performance based on previous codeflows, and hasn't been fully explored publicly AFAIK. There's probs some dragons in there in at least the pre spectre versions of that software.
If you're talking about Rosetta, it's not as clear that you'd see any successful attacks. It only runs at one CPU privilege level. And even in the browser sandbox escape versions Rosetta heavily uses AOT when it can, so your JS is probably not sharing a translation cache with much if any of the code you'd be attacking.
This is in contrast to Transmeta where the whole system more or less ran out of the one translation cache.
Some consider E2K (Elbrus 2000) as successor to Transmeta Crusoe and it is not effected to Spectre issue. It does binary translation of x86 code as well and quite fast (20% preformance loss).
I wouldn't say outdated, but these ideas have a long history. Superscalar processors go back to Cray's CDC 6600 in 1966. Cache memory was used in the IBM System/360 Model 85 in 1969. Pipelining was used in the ILLIAC II (1962) and probably earlier. Branch prediction was used in the IBM Stretch (1961). Out-of-order execution and register renaming was implemented in the IBM System/360 Model 91 (1966).
It's interesting to see how many CPU architecture ideas that we consider modern were first developed in the 1960s, and how they took a long time to move into microprocessor.
That's of course true, but it's might be misleading. OoO didn't take off until HPS came up with the reorder buffer (and enabled precise exceptions), with Pentium Pro being the first (and highly successful) implementation. Also, branch prediction has dramatically improved since stretch, McFarly's two-level was a breakthrough and the current state of the art is Seznec's TAGE-SC.
My point is that there's still a lot of advancement made.
You are right, outdated would imply not useful which is obviously not true. But all of this stuff was in my copy of Hennessy & Patterson from those days so it is not exactly new (because I am that old!).
ahem, what HAS changed since then? besides new models & more updated "MHz" values and some tables with performances, nothing that is of interest to a compressed introduction to the topic. So, personally, what would you have added to the article?
I was curious about the following comment on SMT in the post:
>"From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread – things like the program counter, the architecturally-visible registers (but not the rename registers), the memory mappings held in the TLB, and so on. Luckily, these parts only constitute a tiny fraction of the overall processor's hardware."
Is each "SMT core" then just one additional PC and TLB then? I'm not sure if "SMT core" is the correct term or just "SMT" is but it seems like generally with Hyper Threading there is generally 1 hyper thread available for each core effectively doubling the total core count. It seems like it's been that way for a very long time. Is there not much benefit beyond offering single hyper thread/SMT for each core? Or is just prohibitively expensive?
>"The key question is how the processor should make the guess. Two alternatives spring to mind. First, the compiler might be able to mark the branch to tell the processor which way to go. This is called static branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted not-taken.
Could someone say what the definition of "backward" vs a "forward" is? Is backward the loop continues and forward a jump or return from a loop?
Also are there any examples of "static branch prediction" CPU architectures?
I would love for an update that covers recent developments in SOC integrations, for example the onboarding of RAM and neural processing in the M1 chip.
These kinds of optimizations always make me wonder whether they are worth it. Might it be more efficient to use these transistors for more, simple cores instead? Perhaps the property that most problems are so sequential makes timing/clock rate optimizations inevitable.
I would like to see a superscalar OoO CPU with a RISC-V ISA. Since RISC-V cores tend to be very small, I would expect to see a CPU with hundreds of cores.
https://speakerdeck.com/alblue/understanding-cpu-microarchit...