AMD would be able to do DRAM on package for the lowest wattage "ultrabook" chips, at the cost of producing a very different package for them vs. the bigger laptops that are expected to have upgradable SODIMMs. But I doubt that this "co-location" is that huge for performance. Whatever memory frequency and timings Apple is using are likely easily achievable through the regular mainboard PCB, maybe at the cost of slightly more voltage. DDR4 on desktop is overclockable to crazy levels and that's going through lots of things (CPU package - socket pins - board - slots - DIMMs).
> stacked modules such that the side wall of the Mac Pro is a grid of 4 or more such modules each with co-located memory like the M1 has
Quad or more package NUMA topology?? The latency would absolutely suck.
Why would latency suck? 64-cores are already only beneficial for algorithms which are parallelizable -- with the most common class of parallelizable algorithm being data parallelizable ... So -- shouldnt the hardware and os be able to present the programmer illusion of uniform memory and just automatically arrange for the processing to happen on the compute resources closest to the RAM / move the memory closer to the appropriate compute resource as required?
Yeah, I'm no kernel developer, but I've been replying to anyone saying 'just stick n * M1 in it' that even AMD has been trying to move back to more predictable memory access latency, less NUMA woes.
But in general we're moving toward even less uniform memory, with some of it living on a GPU. NUMA pretended that all memory was the same latency, because C continues to pretend we're on a faster PDP-11, but this seems like a step in the wrong direction as for how high-performance computation is progressing.
> stacked modules such that the side wall of the Mac Pro is a grid of 4 or more such modules each with co-located memory like the M1 has
Quad or more package NUMA topology?? The latency would absolutely suck.