So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
Thing is, they could have killed it by 1998, without ever releasing anything, that would have killed the other architectures it was trying to compete with. Instead they waited until 2020 to end support.
What the VLIW of Itanium needed and never really got was proper compiler support. Nvidia has this in spades with CUDA. It's easy to port to Nvidia where you do get serious speedups. AVX-512 never offered enough of a speedup from what I could tell, even though it was well supported by at least ICC (and numpy/scipy when properly compiled)
> What the VLIW of Itanium needed and never really got was proper compiler support.
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".
No, VLIW is even worse than this. Describing it as a compiler problem undersells the issue. VLIW is not tractable for a multitasking / multi tenant system due to cache residency issues. The compiler cannot efficiently schedule instructions without knowing what is in cache. But, it can’t know what’s going to be in cache if it doesn’t know what’s occupying the adjacent task time slices. Add virtualization and it’s a disaster.
If it's pure TFLOPs you're after, you do want a more or less statically scheduled GPU. But for CPU workloads, even the low-power efficiency cores in phones these days are out of order, and the size of reorder buffers in high-performance CPU cores keeps growing. If you try to run a CPU workload on GPU-like hardware, you'll just get pitifully low utilization.
So it's clearly true that the transistor overhead of dynamic scheduling is cheap compared to the (as-yet unsurmounted) cost of doing static scheduling for software that doesn't lend itself to that approach. But it's probably also true that dynamic scheduling is expensive compared to ALUs, or else we'd see more GPU-like architectures using dynamic scheduling to broaden the range of workloads they can run with competitive performance. Instead, it appears the most successful GPU company largely just keeps throwing ALUs at the problem.
I think OP meant "transistor count overhead" and that's true. There are bazillions of transistors available now. It does take a lot of power, and returns are diminishing, but there are still returns, even more so than just increasing core count. Overall what matters is performance per watt, and that's still going up.
> So far as killing HP PA-Risc, SGI MIPS, DEC Alpha, and seriously hurting the chance for adoption of Sparc, and POWER outside of their respective parents (did I miss any)?
I would argue that it was bound to happen one way or another eventually, and Itanium just happened to be a catalyst for the extinction of nearly all alternatives.
High to very high performance CPU manufacturing (NB: the emphasis is on the manufacturing) is a very expensive business, and back in the 1990's no-one was able (or willing) to invest in the manufacturing and commit to the continuous investment in keeping the CPU manufacturing facilities up to date. For HP, SGI, Digital Equipment, SUN, and IBM, a high performance RISC CPU was the single most significant enabler, yet not their core business. It was a truly odd situation where they all had a critical dependency on CPU's, yet none of them could manufacture them themselves and were all reliant on a third party[0].
Even Motorola that was in some very serious semicondutor business could not meet the market demands[1].
Look at how much it costs Apple to get what they want out of TSMC – it is tens of billions of dollars almost yearly, if not yearly. We can see very well today how expensive it is to manufacture a bleeding-edge, high-performing CPU – look no further than Samsung, GlobalFoundries, the beloved Intel, and many others. Remember the days when Texas Instruments used to make CPU's? Nope, they don't make them anymore.
[0] Yes, HP and IBM used to produced their own CPU's in-house for a while, but then that ceased as well.
[1] The actual reason why Motorola could not meet the market demand was, of course, an entirely different one – the company management did not consider the CPU's to be their core business as they primarily focused on other semiconductor products and on defence, which left the CPU production in an underinvested state. Motorola could have become a TSMC if they could see the future through a silicon dust shroud.
Thing is, they could have killed it by 1998, without ever releasing anything, that would have killed the other architectures it was trying to compete with. Instead they waited until 2020 to end support.
What the VLIW of Itanium needed and never really got was proper compiler support. Nvidia has this in spades with CUDA. It's easy to port to Nvidia where you do get serious speedups. AVX-512 never offered enough of a speedup from what I could tell, even though it was well supported by at least ICC (and numpy/scipy when properly compiled)