GDDR7 Memory Supercharges AI Inference

cogman10 · 2024-10-28T14:25:25 1730125525

What an annoying article to read. "The AI workload of AI in a digital AI world that the AI world AI when it AIs. Also the bandwidth is higher. AaaaaaaaaIiiiiiiiiiii".

90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.

cogman10 · 2024-10-28T14:38:24 1730126304

Ok, I should be fair, it's 4 paragraphs of fluff, 6 paragraphs of specs, then a fluff conclusion. It's almost like 2 different unrelated articles smashed into 1.

Still makes for an annoying read.

ep103 · 2024-10-28T14:45:56 1730126756

Sounds like the sorta thing AI would write

Cthulhu_ · 2024-10-28T14:52:16 1730127136

AI only appears 7 times in the article's 11 paragraphs though. I mean I'm sure it's fluffed out, I glazed over and lost interest, but still.

rsynnott · 2024-10-28T14:41:05 1730126465

> 90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.

I mean, to be fair, that’s half the industry right now. Hard to blame them all that much.

skyyler · 2024-10-28T14:44:48 1730126688

>that’s half the industry right now

Isn't that a bad thing?

rsynnott · 2024-10-28T17:04:31 1730135071

Yes, but… It is how it is. Give it a few years, and it’ll be some new fad. There is always a buzzword du jour. This has been more or less the case for the industry since the 1950s.

Retr0id · 2024-10-28T14:33:26 1730126006

PAM3 is 3 levels per unit interval (~1.58 bits), not 3 bits per cycle as reported in this article. Although I suppose if you count a cycle as both edges of the clock it's 3.17 bits.

alberth · 2024-10-28T14:16:10 1730124970

Is I/O starvation the bottleneck with GPUs?

I didn't think it was.

hmottestad · 2024-10-28T14:17:36 1730125056

Memory bandwidth is the bottleneck for LLM inference. That's my understanding at least.

moffkalast · 2024-10-28T14:27:45 1730125665

That's correct, but also compute to some degree. The larger the model the more of a bottleneck memory becomes.

There are some older HBM cards with very high bandwidth like the Radeon Pro VII which has 1TB/s of bandwidth like the RTX 3090 and 4090, but is notably slower at inference for smaller models since has less compute in comparison. At least I think that was the consensus of some benchmarks people ran.

littlestymaar · 2024-10-28T14:33:36 1730126016

Isn't it only the case when inference isn't batched?

Tostino · 2024-10-28T14:41:43 1730126503

Even in a local setting, batched inference is useful to be able to run more "complex" workflows (with multiple, parallel LLM calls for a single interaction).

There is very little reason to optimize for just single stream inference at the expense of your batch inference performance.

mmoskal · 2024-10-28T14:26:08 1730125568

With a typical transformer and a GPU the batch size that saturates the compute is at least hundreds. Otherwise (including typical size of 1 for local inference) you're memory bound.

corysama · 2024-10-28T14:34:03 1730126043

GPUs have much more memory bandwidth than CPUs. Meanwhile, the ALU:bandwidth ratio of both GPUs and CPUs has been growing exponentially since the 90s at least. So, the FLOPs per byte required to not be starved on memory is really large at this point. We’re at a point that optimization is 90% about SRAM utilization and you worry about the math maybe at the last step.

alwayslikethis · 2024-10-28T14:21:05 1730125265

For inference, it often is. Though for most consumer parts the bigger concern is not having enough VRAM rather than the VRAM not being fast enough. Copying from system RAM to VRAM is far slower.

vdfs · 2024-10-28T14:14:28 1730124868

Not even a mention of Blockchain

trollbridge · 2024-10-28T14:16:21 1730124981

Can’t attract VC money with it anymore

Tostino · 2024-10-28T14:29:40 1730125780

That's good IMO. So much money wasted over the past decade.

mnky9800n · 2024-10-28T14:41:04 1730126464

[flagged]

Tostino · 2024-10-28T14:49:05 1730126945

I wasn't even talking about the people "investing" in crypto. Just the VC / business side of things.

Just a massive waste, on people who had just about no plan going in other than "disrupt the status quo" and "decentralized".

blackoil · 2024-10-28T14:25:43 1730125543

If 5090 comes with 32GB of this RAM. That should be substantial boost over 4090!! Hope that isn't reflected in the price.

moffkalast · 2024-10-28T14:29:18 1730125758

Nvidia: You're getting 2GB of VRAM and you're gonna act like you like it!

formerly_proven · 2024-10-28T15:53:31 1730130811

> Hope that isn't reflected in the price.

lmao

AMD has even officially announced at this point that they will not compete on high-end consumer and workstation GPUs for years to come. Intel can’t (Gaudi is not general-purpose, so too limited appeal for that market).

hmottestad · 2024-10-28T14:21:32 1730125292

"With this new encoding scheme, GDDR7 can transmit “3 bits of information” per cycle, resulting in a 50% increase in data transmission compared to GDDR6 at the same clock speed."

Sounds pretty awesome. I would think that it's going to be much hard to achieve the same clock speeds.

inportb · 2024-10-28T14:37:40 1730126260

If it could really do that, then it wouldn't be DDR, right?

formerly_proven · 2024-10-28T15:54:50 1730130890

DDR just says symbols are centered on (both) edges, doesn’t say what the symbols are.

ilaksh · 2024-10-28T15:15:26 1730128526

So it's almost twice the performance? That's great. But AI could actually easily use 10 times.

Anyone heard anything about memristors being in a real large scale memory/compute product?

sva_ · 2024-10-28T14:31:04 1730125864

Trying to figure out how this compares to HBM3/e

octocop · 2024-10-28T14:49:19 1730126959

What does "48 Gigatransfers per second (GT/s)" mean?

smolder · 2024-10-28T15:18:31 1730128711

It reflects the data rate. Since DDR memory transfers data on both the up and down part of the clock signal, DDR RAM on a 3000Mhz clock signal is said to make 6000 Megatransfers per second, in normal usage. 48 GT/s would imply a 24Ghz clock if it were normal DDR, which seems absurd.

Edit: It seems GDDR6 is in reality "quad data rate" memory, and GDDR7 packs even more bits in per clock using PAM3 signaling, so if I'm reading this right maybe they're saying the chips can run at up to 8Ghz base clock? 8ghz * 6 bits per cycle * 32 bit bus / 8 bits per byte = 192GB/s.

Edit again: It seems I undercounted the number of bits/pin per cycle of base clock for GDDR7 and it's more like 12 (so 4ghz max base clock) or even 24 (so 2ghz), which seems a lot more reasonable.

TowerTall · 2024-10-28T16:38:18 1730133498

Gigatransfers per second (GT/s) measures the rate of data transfers rather than the data rate itself. Each "transfer" represents one unit of data moved across a data bus per clock cycle, which can be thought of as one signal transition on the bus.

grahamj · 2024-10-28T14:20:17 1730125217

Well, yeah

Any bets on when it gets renamed AIDDR? Only partly joking

the-rc · 2024-10-28T14:48:20 1730126900

More like NeuralRAM? We have precedents. Back in the 90s, Sun and Mitsubishi came up with 3DRAM, which replaced the RMW cycle in Z-buffering and alpha blending with a single (conditional) write, moving the arithmetic into the memory chips.

burnte · 2024-10-28T14:31:57 1730125917

DDR with Copilot!