The rule of thumb I was taught was that going from a DSP/GPU to a custom ASIC would give you a 10X advantage in performance/power which is pretty close to this. And look at how much bitcoin mining ASICs out compete GPUs.
Generally speaking GPUs are already very good at running with float32s, usually much better than they are at using float64s in fact. The big advantages of using an ASIC are mostly on the storage side but they also allow you to get away with non IEEE floating point numbers that don't necessarily implement subnormals, NaN, etc.
I doubt FP hardware size is the limiting factor in their implementation (it's not in GPUs and definitely not in high perf CPUs). More likely they came up with higher level architectural tricks that let them specialize for machine learning (i.e. taking better advantage of locality in the application, etc).