You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length
It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:
Please ignore my previous comments - I double checked with the model developers and here's the correction. Vanilla PTQ means no fancy quantization algorithm like SpinQuant, AWQ, etc. was applied. It just applied the same quantization scheme mentioned in the post (4bit per-group with g_size=32 symmetric weight, 8bit dynamic per token activation).
So this should be referring to w8a8 (weights and activations in 8 bit)
So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization
Not that I know of for this study, at least for the specific scope torchao we want to make it easier for researchers to create new quantization algorithms in python and have those algorithms run fast and you can see a lot of those algorithms here https://github.com/pytorch/ao/tree/main/torchao/prototype
So for example for AWQ and GPTQ we can accelerate them by using a fast int4 kernel called tinygemm
The issue here is memory in PyTorch is byte addressable and that's a limitation we can't solve without making a lot more changes to PyTorch. But in your specific case, if you'd like to pack more data into `values` you can use a combination of clever bit shifting, torch.cat and other bit twiddling pytorch like ops to pack more data. It's a trick we use quite heavily in torchao
It's a great question! Int4 is an easy one to understand. PyTorch supports int8 but not int4 so what you can do is "pack" 2 int4 values into a single int8 value. You still get speedups even without hardware support because you're sending less data to the GPU and workloads like small batch size LLM inference are memory bandwidth bound and not compute bound. So indeed your intuition is correct you pack the values and before doing a matmul you "unpack" them back into an int8 and then upcast to fp16 to do a matmul
Hi! I'm Mark from the PyTorch team at Meta and work on torchao. If you have any questions about the library or really anything at all about performance, don't hesitate to ask!
A minor nitpick on the copy (and even then, it might just be me): I find "97% speedup" and "50% speedup" really hard to parse — a "30x speedup" or "97% reduction of time taken" immediately tell me what is being achieved!
Great results once I get my head around them, though!
That's why it's confusing: "2x speedup" would clearly indicate 200% of the current speed, so 97% speedup is unclear if it's a multiple (not because that would be a slow down), a reduction in time (which was my assumption) or an increase in speed (something per unit of time).
I guess you are right and it's probably the latter, but obviously better language would have avoided any doubt.
Hi Mark, the library looks cool, excited to try it out. Coincidentally I am starting work on a project that is investigating a lot of Post training quantization methods. I read the blog and I am curious to understand what kind of overheads are involved in quantizing a layer?
There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`
Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.
However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.
First off, well done, this looks exciting. I haven't had a chance to interact with the library yet — should torchao be seen as a dev-friendly quantization interface? I.e., if my team was working on new quantization techniques, does the API provide easy tooling for implementing and benchmarking new quantization algorithms? Or is this closer to a "toolbox of finished (generally) finished products"?
It's both! For this blog we decided to discuss our best end user facing numbers to keep things simple. We briefly hint at our contributor guide here https://github.com/pytorch/ao/issues/391 which does a tour of the APIs we provide developers implementing new algorithms
But we have had quantization algorithm developers such as HQQ or Autoround merge their code in to get composability and serialization for free. We view quantization algorithms as the top layer and going down you have quantized tensors, quant primitives like dequant/quant and finally basic dtypes like uint1-7 and float3-8. Personally why I spent so much time on AO was I was hoping we could make it easier for people to express their quantization algorithms in easy to read PyTorch code and if they must use custom kernels we also have some tutorials for how to integrate custom cuda and triton ops.
Most of those discussions have been happening on #torchao on discord.gg/gpumode so if you need to chat back and forth feel free to reach out to the team there otherwise Github also works.
Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.
In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.
This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?
Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend
Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol
But that's waiting for Blackwell to be released so we get the hardware support. SO recommendation for now would be to use either fp8 training or int8 training
There's different tradeoffs, spinning up a separate repo is what we call "out of core" vs having everything in PyTorch "in core"
Basically PyTorch is a large library where CI takes a long time to run which means merging code is hard and adding new dependencies is challenging and there are stringent constraints on BC breaking changes
Instead what torchao did and many other repos like torchtune, torchchat, torchtitan did was move out of core and it helps keep the core PyTorch library leaner with a smaller binary size and it really lets the team "out of core" focus on optimizing for their needs
Unfortunately the argument for what gets better changes over time, for example torch.compile initially a new repo called torchdynamo was built out of core to move fast but eventually merged back because everyone wanted to use it. Now torch.compile dev velocity is still quite fast and so now we have to tell people to use nightlies instead of official stable releases to which some people have asked me why don't you move torch.compile out of core
My 2c is the ecosystem will be much stronger and teams can move faster if they develop out of core so that's the tradeoff we picked for torchao. We managed to for example merge a few custom CPP kernels like fp6 or Marlin that would have challenging to motivate in core since those are still quite experimental and need to stand the test of time.
nvtop or nvidia-smi gives you a good macro overview but I personally have found that utilization (EDIT: As reported by nvidia-smi) is actually a poor proxy for how fast your workload can be outside of just ensuring that a GPU is indeed being used
I agree that utilization by nvidia-smi is a poor proxy for performance. FWIW, I’ve found that for the same architecture the power consumption reported in nvtop very often correlates super nicely with the training performance and the peak performance is always at peak power consumption. Agreed on your advice for getting to tune your architecture details, but once that’s fixed and you have simple things to debug like memory usage, batch size, dataloading bottlenecks the raw power metric is typically a quick proxy. I find the temperature is a second useful macro metric that; you want to be at max power draw and max allowed temp at all times but not exceed the temperature where you throttle.
That's hard to argue with. Of course power draw is a direct measure of hardware utilization, but it doesn't translate very well to a measure of GPU Code efficiency.
Often you can squeeze out another order of magnitude of performance by rewriting the kernel and the power draw will always stay capped at whatever the maximum is. I'd say GPU power consumption is interesting if you're CPU bound and struggling to feed the GPU enough data and/or tasks.
FLOPs utilization is arguably the industry standard metric for efficiency right now and it should be a good first approximation of how much performance is left on the table.
But if you mean the reported utilization in nvtop is misleading I completely agree (as someone who uses it daily).
I’ve been meaning to dig into the source/docs to see what’s going on. The power usage seems to be a more reliable indicator of actual hardware utilization, at least on nvidia gear.
Thanks, Some people were having random problems installing WSL on their systems and I found this was the easiest solution (but based on their card models, they appeared to have much older machines.
There is no need to install Docker Desktop just to run nvidia-smi in WSL; the Windows directory containing the nvidia-smi binary is mounted inside a WSL instance and added to PATH automatically by WSL on instance startup.
As an aside: there is no need to install Docker Desktop just to use Docker containers in WSL either, unless you want a Windows GUI to manage your containers. Just follow the official documentation for installing Docker in your Linux distro of choice, or simply run `sudo apt install docker.io` in the default WSL Ubuntu distro. Docker will work just fine with an up-to-date WSL.
Further aside, it's possible to have both Docker Desktop and the normal linux Docker.io installed on WSL. They work in isolation, the easy way to know which is active is to check if Docker Desktop is running or not. I wouldn't recommend this set up...
This was a lot of fun to read, really enjoyed the journey from game design to product design to ML. One thing I was hoping to ask was how come you felt the need to procedurally generate levels? I've often heard debates around whether content needs to be "hand-crafted" to be worth another human's time but curious to hear your take on this since it seems like you implemented both ends of the spectrum with endless and classic.
Also do you feel like you now have a general set of principles to procedurally generate levels that apply to games outside of echo chess? Puzzle games are great since they're well scoped for indie devs and I'd imagine a lot of would be puzzle designers would burn out before they create a few dozen-hundred levels.
> how come you felt the need to procedurally generate levels?
Three reasons:
(1) demand for new levels was too high - game traction exceeded my laziness threshold as a designer;
(2) puzzle lovers suffered from lack of replayability - game was effectively punishing core users who engage with it the most;
(3) designing a 'good' difficulty curve required me to quantify a level's difficulty objectively - anytime I design a puzzle or strategy game, I also try to design an algorithm that can solve it to get a general sense of how the different levels compare in difficulty;
(4) I tend to get obsessively curious about stuff - wasn't sure if it's feasible to have a 100% real-time procedural gen of chess mazes so I decided to do it to find out.
> I've often heard debates around whether content needs to be "hand-crafted" to be worth another human's time
LLM jokes aside, I think this is a great point and I'm not sure exactly what the right answer is here. I ended up keeping both Classic and Endless modes for this exact reason. If there's enough interest, I'll add a manual level generator for the community of 'humans' to submit their own hand-crafted creations for others to play.
> a general set of principles to procedurally generate levels that apply to games outside of echo chess?
Good question. I think it really comes down to this:
(a) can you formalize the concept of a 'level' and its components for your game in a way as to encode/decode every game state (at a minimum) easily and efficiently?
(b) can you parametrize this formalism in a way as to connect malleable randomizations to every meaningful component of a level to generate (hypothetical) infinite variability?
(c) do you have a confident way to make sure whatever generation of a 'level' you're churning out is as playable, achievable, and as fun as the manually crafted one? The fun part is the hardest one to automate. That's where I think you need a very strong grasp of what actually make your puzzle game fun. All the rest can likely be generalizable to other games.
> puzzle designers would burn out before they create a few dozen-hundred levels
Am I missing something about crypto lending? Your deposit can't be FDIC insured and if borrowers are defaulting that means your deposit disappears. You can't bailout a crypto bank by printing crypto.
The appreciation of Bitcoin has been so staggering that it also makes getting interest back on your deposits seem rather outdated.
Bitcoin banks have been tried and failed, I should know the US department of Justice sent me victim notification emails about it for years.
Take it from someone who has operated in this space long enough to beware of interest rates on crypto, the money has to come from somewhere to cover the vig
Fractional reserve banking is the thing that is to be avoided and ultimately fails.