Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.



Not really. Registers are irrelevant. They are not the bottleneck.


Computation happens in the registers. If you’re not moving data to registers you aren’t doing any compute.


Obviously yes but NVIDIA Ampere/Hopper architecture has 64k 32-bit registers per SM. A100 has 108 SMs and H100 has 132 SMs so go figure - registers aren't a bottleneck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: