How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?
My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)
A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?
Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM.
Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.
Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.
Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.
In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.
On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking).
Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.
It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.
Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.
The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.
Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.
I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).