> Most of the transistors in a modern computation environment are just sitting there waiting to be touched by an instruction
True, but it's important to note that if all the transistors in a modern CPU switched on at once, it would quickly overheat. This is the "power wall" -- we can squish more transistors in one die than we can actually turn on at one time due to electrical and thermal constraints.
> far, far more transistors in RAM than in the CPUs, overall
Also true, and this is an active area of research. Many people have tried various approaches to performing computations using DDR and other memory technologies. In the past, people were trying to trying to use DDR to run automata. These days there seems to be a lot of focus on processor-in-memory technologies; it turns out memristors can actually be used for computation, effectively turning the entire memory array into a hugely wide SIMD RISC processor. Here is some recent work presented on this subject:
> there's still a lot of room to grow, speed wise.
Yes and no. Single-threaded performance is close to tapping out. Production processes can only shrink so far before physics starts getting in the way. Pipelines and speculation can only get so deep (and broaden surface area for security vulnerabilities). Performance growth for massively parallel workloads is continuing along at a healthy clip, and will probably continue to do so for quite some time. Of course the trouble is that end-user desktop software is generally not massively parallel.
Actually, we can significantly boost single-threaded raw speed. We can't do this for the memory wall however, because the approach is based on MOS current mode logic (MCML).
We can build 20 GHz CPUs now with passive cooling, but they don't beat cutting-edge CMOS cores in memory-hard single-threaded workloads. They do reach 2-cycle add and like 3-cycle mul latency though. I hope someone just plops a RISC-V core with that kind of design down, which you can use with like explicit preloading into a tiny cache that gets 2 or 3 cycle load latency into registers.
I'm sure some computations could work well on that sort of very fast, shallow-pipeline core suited well for highly-sequential stuff like maybe SAT/SMT solvers and other inherently divide-and-conquer algorithms.
You're changing the effective throughput weather you do it by upping the clockrate or deepening the pipeline. Using half the data per cycle at twice the clock rate will cause the same memory pressure.
> which you can use with like explicit preloading into a tiny cache
That will kill it. As soon as you put it on the compiler designers or programmers to do something special to realize performance benefits, you're going to loose to architectures that don't.
Sure, compiler writers and programmers will optimize for your architecture... if it's popular and widely used. So you have a chicken and egg problem where you need to get adoption in the first place by running existing workloads faster.
> We can build 20 GHz CPUs now with passive cooling,
Citation? Like for real, that's cool and I'd like to read about it!
True, but it's important to note that if all the transistors in a modern CPU switched on at once, it would quickly overheat. This is the "power wall" -- we can squish more transistors in one die than we can actually turn on at one time due to electrical and thermal constraints.
> far, far more transistors in RAM than in the CPUs, overall
Also true, and this is an active area of research. Many people have tried various approaches to performing computations using DDR and other memory technologies. In the past, people were trying to trying to use DDR to run automata. These days there seems to be a lot of focus on processor-in-memory technologies; it turns out memristors can actually be used for computation, effectively turning the entire memory array into a hugely wide SIMD RISC processor. Here is some recent work presented on this subject:
Real Processing-in-Memory with Memristive Memory Processing Unit (mMPU) - https://ieeexplore.ieee.org/document/8825114
PPAC: A Versatile In-Memory Accelerator for Matrix-Vector-Product-Like Operations - https://ieeexplore.ieee.org/document/8825013
Parallel Stateful Logic in RRAM: Theoretical Analysis and Arithmetic Design - https://ieeexplore.ieee.org/document/8825150
> there's still a lot of room to grow, speed wise.
Yes and no. Single-threaded performance is close to tapping out. Production processes can only shrink so far before physics starts getting in the way. Pipelines and speculation can only get so deep (and broaden surface area for security vulnerabilities). Performance growth for massively parallel workloads is continuing along at a healthy clip, and will probably continue to do so for quite some time. Of course the trouble is that end-user desktop software is generally not massively parallel.