What an annoying article to read. "The AI workload of AI in a digital AI world that the AI world AI when it AIs. Also the bandwidth is higher. AaaaaaaaaIiiiiiiiiiii".
90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.
Ok, I should be fair, it's 4 paragraphs of fluff, 6 paragraphs of specs, then a fluff conclusion. It's almost like 2 different unrelated articles smashed into 1.
Yes, but… It is how it is. Give it a few years, and it’ll be some new fad. There is always a buzzword du jour. This has been more or less the case for the industry since the 1950s.
PAM3 is 3 levels per unit interval (~1.58 bits), not 3 bits per cycle as reported in this article. Although I suppose if you count a cycle as both edges of the clock it's 3.17 bits.
That's correct, but also compute to some degree. The larger the model the more of a bottleneck memory becomes.
There are some older HBM cards with very high bandwidth like the Radeon Pro VII which has 1TB/s of bandwidth like the RTX 3090 and 4090, but is notably slower at inference for smaller models since has less compute in comparison. At least I think that was the consensus of some benchmarks people ran.
Even in a local setting, batched inference is useful to be able to run more "complex" workflows (with multiple, parallel LLM calls for a single interaction).
There is very little reason to optimize for just single stream inference at the expense of your batch inference performance.
With a typical transformer and a GPU the batch size that saturates the compute is at least hundreds. Otherwise (including typical size of 1 for local inference) you're memory bound.
GPUs have much more memory bandwidth than CPUs. Meanwhile, the ALU:bandwidth ratio of both GPUs and CPUs has been growing exponentially since the 90s at least. So, the FLOPs per byte required to not be starved on memory is really large at this point. We’re at a point that optimization is 90% about SRAM utilization and you worry about the math maybe at the last step.
For inference, it often is. Though for most consumer parts the bigger concern is not having enough VRAM rather than the VRAM not being fast enough. Copying from system RAM to VRAM is far slower.
AMD has even officially announced at this point that they will not compete on high-end consumer and workstation GPUs for years to come. Intel can’t (Gaudi is not general-purpose, so too limited appeal for that market).
"With this new encoding scheme, GDDR7 can transmit “3 bits of information” per cycle, resulting in a 50% increase in data transmission compared to GDDR6 at the same clock speed."
Sounds pretty awesome. I would think that it's going to be much hard to achieve the same clock speeds.
It reflects the data rate. Since DDR memory transfers data on both the up and down part of the clock signal, DDR RAM on a 3000Mhz clock signal is said to make 6000 Megatransfers per second, in normal usage. 48 GT/s would imply a 24Ghz clock if it were normal DDR, which seems absurd.
Edit: It seems GDDR6 is in reality "quad data rate" memory, and GDDR7 packs even more bits in per clock using PAM3 signaling, so if I'm reading this right maybe they're saying the chips can run at up to 8Ghz base clock?
8ghz * 6 bits per cycle * 32 bit bus / 8 bits per byte = 192GB/s.
Edit again: It seems I undercounted the number of bits/pin per cycle of base clock for GDDR7 and it's more like 12 (so 4ghz max base clock) or even 24 (so 2ghz), which seems a lot more reasonable.
Gigatransfers per second (GT/s) measures the rate of data transfers rather than the data rate itself. Each "transfer" represents one unit of data moved across a data bus per clock cycle, which can be thought of as one signal transition on the bus.
More like NeuralRAM? We have precedents. Back in the 90s, Sun and Mitsubishi came up with 3DRAM, which replaced the RMW cycle in Z-buffering and alpha blending with a single (conditional) write, moving the arithmetic into the memory chips.
90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.