Getting bigger caches closer to the CPUs seems to be the big issue mentioned. SRAM takes up more real estate than DRAM, so on-chip DRAM is being considered, but doesn't look likely. Further out from the CPU, there are technologies faster than flash but slower than DRAM coming along.
This article doesn't address the architectural issues of what to do with devices which look like huge, but slow, RAM. The history of multi-speed memory machines is disappointing. Devices faster than flash are too fast to handle via OS read and write, but memory-mapping the entire file system makes it too vulnerable to being clobbered. The Cell processor tried lots of memory-to-memory DMA, and was too hard to program.
Maybe hardware key/value stores, so you can put database indices in them and have instructions for accessing them.
I'm waiting for Zen, but I'll hold my breath on the APU. Even if they integrate HBM that's going to be a rather large die with low yields to try to cram a GPU comparable to a modern mid-high to high-end GPU (that already has a die size bigger than most modern CPU's). Still, even a moderately powerful APU would work nicely in tandem with a future RX 490 or whatever AMD puts out for their Vega cards.
"And further on, there are larger SRAMs for L3 caches, where they are possible."
I believe L3 is always SRAM so I'm confused about this. Also I thought that larger L3 caches were only an issue in terms of cost and perhaps power budgeting for the die. Are there other issues as well with moving to larger L3 caches?
Reduction in price isn't enough on its own, you need to also reduce the growth in memory usage by software to be slower than that reduction in price, and stay slower for a long time.
Even then, it's rough. Consider that the IBM 704 from 1954 (the first machine to run Lisp) had a memory bandwidth of around 375kBps. That means that in theory it could need up to 12TB to run for a year without reclaiming memory, although real-world usage would surely be much less.
So, just the most important ones, then. In a sense, processes are just a form of arena allocation. They don't eliminate the need for memory management; they are memory management (among many other things).
Even if memory is cheap, it still uses power, and as a result, has lots of heat to dissipate. In many high-end server designs, the power / heat aspects are the limiting factor -- not the cost.
I have always wonder if it is possible to do TSV / Stacked SRAM on top or under the CPU Die.
So instead of have 8MB sitting on the same plane, the same die size of SRAM, more likely 16 - 32MB could sit under or on top of the CPU die.
The hurdle is heat - we can do something like this in mobile SoCs, which commonly place DRAM on top of the CPU (package-on-package). But for a TDP > 10 watts, the memory layer effectively insulates the main die from whatever thermal management is used, making it unworkable. Unfortunately, this problem stands to get worse as transistors shrink, since power density will keep going up.
This article doesn't address the architectural issues of what to do with devices which look like huge, but slow, RAM. The history of multi-speed memory machines is disappointing. Devices faster than flash are too fast to handle via OS read and write, but memory-mapping the entire file system makes it too vulnerable to being clobbered. The Cell processor tried lots of memory-to-memory DMA, and was too hard to program.
Maybe hardware key/value stores, so you can put database indices in them and have instructions for accessing them.