In a conv layer, weights are shared across many input features. E.g. assume a 1x1 conv layer with a 28x28x3 input. You only need to load 3 weights even though there are effectively 28x28=748 different dot products. In practice, the input and output activations can be stored on chip as well (except for the first layer) which means the ratio of operations to DRAM accesses can be incredibly high. For some real world examples, take a look at the classic Eyeriss paper[1] which finds a ratio of 345-285 for AlexNet and VGG-16 respectively. You can also check out the TPU paper[2] which places the ratio at >1000 for some unnamed CNNs. Compare that to your analysis which yields a ratio of 2.
He has a point. There's a bit amount of data-reuse in CNNs.
Hmm... it will depend on the CNN. There's probably a good neural network design that would take advantage of this architecture. IE: A well recycled convolutional layer that probably fits within the 32MB (load those weights once, use them across the whole picture).
So the whole NN doesn't necessarily have to fit inside of 32MB to be useful. But at least, large portions have to fit. (say, a 128x128 tile with 20 hidden-layers is only 300kB). Recycling that portion across the 1080 x 1920 input would be relatively cheap.
I herp-derped early on, there seem to be CNNs that would make good use of the architecture. Still, the memory bandwidth of that chip is very low, I'd expect GDDR6 or HBM2 to definitely be superior to the 68GBps LPDDR4 chip they put in there.