More

formalsystem · 2024-10-30T23:47:02 1730332022

The project is very much focused on maxing out tensor cores and since older GPUs don’t have them it’s not where the project shines best

formalsystem · 2024-10-25T16:48:34 1729874914

You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length

Some pretty charts here https://github.com/pytorch/ao/issues/539

formalsystem · 2024-10-25T04:12:17 1729829537

The naming is unfortunate but in this blog QLoRA is referring to Quantization-Aware Training with LoRA adaptor

formalsystem · 2024-10-24T22:40:01 1729809601

It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.

At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8

formalsystem · 2024-10-24T22:37:44 1729809464

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

philipkglass · 2024-10-24T22:46:17 1729809977

What was the "vanilla post-training quantization" used for comparison? There are 22 GGUF quantization variants smaller than 16 bits per weight and I can't tell which one is being compared with:

https://huggingface.co/docs/hub/en/gguf#quantization-types

It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.

formalsystem · 2024-10-25T21:07:25 1729890445

Please ignore my previous comments - I double checked with the model developers and here's the correction. Vanilla PTQ means no fancy quantization algorithm like SpinQuant, AWQ, etc. was applied. It just applied the same quantization scheme mentioned in the post (4bit per-group with g_size=32 symmetric weight, 8bit dynamic per token activation).

formalsystem · 2024-10-24T23:05:58 1729811158

So this should be referring to w8a8 (weights and activations in 8 bit)

So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization

imjonse · 2024-10-25T05:15:18 1729833318

Were there comparisons made to AWS, Smoothquant, GPTQ or other non-vanilla PTQ methods? Thanks.

formalsystem · 2024-10-25T05:53:03 1729835583

Not that I know of for this study, at least for the specific scope torchao we want to make it easier for researchers to create new quantization algorithms in python and have those algorithms run fast and you can see a lot of those algorithms here https://github.com/pytorch/ao/tree/main/torchao/prototype

So for example for AWQ and GPTQ we can accelerate them by using a fast int4 kernel called tinygemm

Evidlo · 2024-10-25T03:51:48 1729828308

I have a non-ML question.

In vanilla Pytorch I have the following expression:

    t.sum(values[inds] * weights)

If 'inds' is int8, I get "IndexError: tensors used as indices must be long, int, byte or bool tensors".

Is this still true if I use torchao?

formalsystem · 2024-10-25T05:56:34 1729835794

The issue here is memory in PyTorch is byte addressable and that's a limitation we can't solve without making a lot more changes to PyTorch. But in your specific case, if you'd like to pack more data into `values` you can use a combination of clever bit shifting, torch.cat and other bit twiddling pytorch like ops to pack more data. It's a trick we use quite heavily in torchao

Evidlo · 2024-10-25T16:50:10 1729875010

Arent int8s byte-aligned though? I thought this restriction was originally motivated by maintenance overhead of having to support more dtypes.

saagarjha · 2024-10-25T03:24:01 1729826641

Do you ever pronounce torchao in a way that rhymes with "wow"

formalsystem · 2024-10-25T03:40:50 1729827650

My wife calls it torch AAAW

formalsystem · 2024-10-01T16:10:15 1727799015

It's a great question! Int4 is an easy one to understand. PyTorch supports int8 but not int4 so what you can do is "pack" 2 int4 values into a single int8 value. You still get speedups even without hardware support because you're sending less data to the GPU and workloads like small batch size LLM inference are memory bandwidth bound and not compute bound. So indeed your intuition is correct you pack the values and before doing a matmul you "unpack" them back into an int8 and then upcast to fp16 to do a matmul

formalsystem · 2024-10-01T01:07:22 1727744842

Hi! I'm Mark from the PyTorch team at Meta and work on torchao. If you have any questions about the library or really anything at all about performance, don't hesitate to ask!

necovek · 2024-10-01T07:14:51 1727766891

Great stuff!

A minor nitpick on the copy (and even then, it might just be me): I find "97% speedup" and "50% speedup" really hard to parse — a "30x speedup" or "97% reduction of time taken" immediately tell me what is being achieved!

Great results once I get my head around them, though!

IanCal · 2024-10-01T08:24:29 1727771069

Fwiw I'm pretty sure 97% speedup is 197% of the speed of the baseline, so roughly double.

necovek · 2024-10-01T10:04:09 1727777049

That's why it's confusing: "2x speedup" would clearly indicate 200% of the current speed, so 97% speedup is unclear if it's a multiple (not because that would be a slow down), a reduction in time (which was my assumption) or an increase in speed (something per unit of time).

I guess you are right and it's probably the latter, but obviously better language would have avoided any doubt.

elcomet · 2024-10-01T11:46:54 1727783214

I understand it as " the speed increases by 97%"

formalsystem · 2024-10-01T16:06:22 1727798782

yeah indeed choice of language might not be ideal, it seems like 2x language is clearest to folks? I can make some quick edits to the article

DhawalModi · 2024-10-01T01:59:53 1727747993

Hi Mark, the library looks cool, excited to try it out. Coincidentally I am starting work on a project that is investigating a lot of Post training quantization methods. I read the blog and I am curious to understand what kind of overheads are involved in quantizing a layer?

formalsystem · 2024-10-01T02:21:08 1727749268

There's a bunch of overhead associated with PTQ - but TL;DR is that much of that overhead goes away when you're using `torch.compile()` and `torchao.autoquant()`

Essentially the latency overhead comes from quantizing and dequantizing weights and activations. For large layers this overhead is small because by quantizing your weights for example you reduce memory bandwidth pressure but for small layers the overhead of potentially looking up a table, reading scaling factors, quantization/dequantization and finally handling zero points might not be worth it.

However, even if such overhead exists you can still quantize your model and get it to be smaller it might not be faster is the problem. We solve the speed problem in 2 ways - `torch.compile()` will fuse operations like a dequant and matmul into a single kernel and `torchao.autoquant()` will do kernel level profiling to see whether a layer is actually made faster when quantizing and if not it skips quantizing that layer.

DhawalModi · 2024-10-01T02:49:13 1727750953

I see, thank you for the explanation!

dark__paladin · 2024-10-01T05:55:02 1727762102

First off, well done, this looks exciting. I haven't had a chance to interact with the library yet — should torchao be seen as a dev-friendly quantization interface? I.e., if my team was working on new quantization techniques, does the API provide easy tooling for implementing and benchmarking new quantization algorithms? Or is this closer to a "toolbox of finished (generally) finished products"?

formalsystem · 2024-10-01T06:39:26 1727764766

It's both! For this blog we decided to discuss our best end user facing numbers to keep things simple. We briefly hint at our contributor guide here https://github.com/pytorch/ao/issues/391 which does a tour of the APIs we provide developers implementing new algorithms

But we have had quantization algorithm developers such as HQQ or Autoround merge their code in to get composability and serialization for free. We view quantization algorithms as the top layer and going down you have quantized tensors, quant primitives like dequant/quant and finally basic dtypes like uint1-7 and float3-8. Personally why I spent so much time on AO was I was hoping we could make it easier for people to express their quantization algorithms in easy to read PyTorch code and if they must use custom kernels we also have some tutorials for how to integrate custom cuda and triton ops.

Most of those discussions have been happening on #torchao on discord.gg/gpumode so if you need to chat back and forth feel free to reach out to the team there otherwise Github also works.

soulofmischief · 2024-10-01T02:09:35 1727748575

Thanks for the hard work, any idea what the roadmap is for MPS support?

formalsystem · 2024-10-01T02:15:11 1727748911

Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.

In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.

underanalyzer · 2024-10-01T02:39:50 1727750390

This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?

formalsystem · 2024-10-01T02:44:27 1727750667

Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend

Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol

darkninja · 2024-10-01T12:45:11 1727786711

Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training.

formalsystem · 2024-10-01T16:07:19 1727798839

we have experimental support for float4 training with the mx formats https://github.com/pytorch/ao/tree/main/torchao/prototype/mx...

But that's waiting for Blackwell to be released so we get the hardware support. SO recommendation for now would be to use either fp8 training or int8 training

darkninja · 2024-10-01T12:45:06 1727786706

Hi mark, Wanted to know if the float4 training is possible with torchao as we trying to fit a large model on a single GPU for training

OutOfHere · 2024-10-01T03:08:04 1727752084

Why don't they merge this into Pytorch? Why so many packages?

formalsystem · 2024-10-01T03:15:58 1727752558

There's different tradeoffs, spinning up a separate repo is what we call "out of core" vs having everything in PyTorch "in core"

Basically PyTorch is a large library where CI takes a long time to run which means merging code is hard and adding new dependencies is challenging and there are stringent constraints on BC breaking changes

Instead what torchao did and many other repos like torchtune, torchchat, torchtitan did was move out of core and it helps keep the core PyTorch library leaner with a smaller binary size and it really lets the team "out of core" focus on optimizing for their needs

Unfortunately the argument for what gets better changes over time, for example torch.compile initially a new repo called torchdynamo was built out of core to move fast but eventually merged back because everyone wanted to use it. Now torch.compile dev velocity is still quite fast and so now we have to tell people to use nightlies instead of official stable releases to which some people have asked me why don't you move torch.compile out of core

My 2c is the ecosystem will be much stronger and teams can move faster if they develop out of core so that's the tradeoff we picked for torchao. We managed to for example merge a few custom CPP kernels like fp6 or Marlin that would have challenging to motivate in core since those are still quite experimental and need to stand the test of time.

formalsystem · 2024-03-13T02:38:10 1710297490

nvtop or nvidia-smi gives you a good macro overview but I personally have found that utilization (EDIT: As reported by nvidia-smi) is actually a poor proxy for how fast your workload can be outside of just ensuring that a GPU is indeed being used

If you're here because you're interested in AI performance I'd recommend instead https://docs.nvidia.com/nsight-compute/NsightComputeCli/inde... to profile individual kernels. Nsight systems for a macro view https://developer.nvidia.com/nsight-systems and the PyTorch profiler if you're not authoring kernels directly but using something PyTorch https://pytorch.org/tutorials/recipes/recipes/profiler_recip...

pama · 2024-03-13T03:21:08 1710300068

I agree that utilization by nvidia-smi is a poor proxy for performance. FWIW, I’ve found that for the same architecture the power consumption reported in nvtop very often correlates super nicely with the training performance and the peak performance is always at peak power consumption. Agreed on your advice for getting to tune your architecture details, but once that’s fixed and you have simple things to debug like memory usage, batch size, dataloading bottlenecks the raw power metric is typically a quick proxy. I find the temperature is a second useful macro metric that; you want to be at max power draw and max allowed temp at all times but not exceed the temperature where you throttle.

ipsum2 · 2024-03-13T03:19:17 1710299957

I've been going off of power draw in nvidia-smi as a proxy of util, doesn't require additional setup or code changes.

KeplerBoy · 2024-03-13T05:04:54 1710306294

That's hard to argue with. Of course power draw is a direct measure of hardware utilization, but it doesn't translate very well to a measure of GPU Code efficiency.

Often you can squeeze out another order of magnitude of performance by rewriting the kernel and the power draw will always stay capped at whatever the maximum is. I'd say GPU power consumption is interesting if you're CPU bound and struggling to feed the GPU enough data and/or tasks.

refibrillator · 2024-03-13T02:54:23 1710298463

FLOPs utilization is arguably the industry standard metric for efficiency right now and it should be a good first approximation of how much performance is left on the table.

But if you mean the reported utilization in nvtop is misleading I completely agree (as someone who uses it daily).

I’ve been meaning to dig into the source/docs to see what’s going on. The power usage seems to be a more reliable indicator of actual hardware utilization, at least on nvidia gear.

VHRanger · 2024-03-13T14:05:10 1710338710

> FLOPs utilization is arguably the industry standard metric for efficiency right now

I'd argue GB/s memory bandwidth is more worried about at the moment.

samstave · 2024-03-13T04:07:57 1710302877

If you install Docker Desktop with WSL2 checked, it automatically lets you run Nvidia-SMI in your WSL ubuntu environ on Windows:

https://i.imgur.com/C24EV5U.png

then sudo apt install nvtop

https://i.imgur.com/SOoCdvR.png

EDIT:

Thanks, Some people were having random problems installing WSL on their systems and I found this was the easiest solution (but based on their card models, they appeared to have much older machines.

acka · 2024-03-13T04:55:17 1710305717

There is no need to install Docker Desktop just to run nvidia-smi in WSL; the Windows directory containing the nvidia-smi binary is mounted inside a WSL instance and added to PATH automatically by WSL on instance startup.

As an aside: there is no need to install Docker Desktop just to use Docker containers in WSL either, unless you want a Windows GUI to manage your containers. Just follow the official documentation for installing Docker in your Linux distro of choice, or simply run `sudo apt install docker.io` in the default WSL Ubuntu distro. Docker will work just fine with an up-to-date WSL.

8A51C · 2024-03-13T09:49:48 1710323388

Further aside, it's possible to have both Docker Desktop and the normal linux Docker.io installed on WSL. They work in isolation, the easy way to know which is active is to check if Docker Desktop is running or not. I wouldn't recommend this set up...

formalsystem · on Aug 31, 2023

This was a lot of fun to read, really enjoyed the journey from game design to product design to ML. One thing I was hoping to ask was how come you felt the need to procedurally generate levels? I've often heard debates around whether content needs to be "hand-crafted" to be worth another human's time but curious to hear your take on this since it seems like you implemented both ends of the spectrum with endless and classic.

Also do you feel like you now have a general set of principles to procedurally generate levels that apply to games outside of echo chess? Puzzle games are great since they're well scoped for indie devs and I'd imagine a lot of would be puzzle designers would burn out before they create a few dozen-hundred levels.

ramly · on Aug 31, 2023

> how come you felt the need to procedurally generate levels?

Three reasons: (1) demand for new levels was too high - game traction exceeded my laziness threshold as a designer; (2) puzzle lovers suffered from lack of replayability - game was effectively punishing core users who engage with it the most; (3) designing a 'good' difficulty curve required me to quantify a level's difficulty objectively - anytime I design a puzzle or strategy game, I also try to design an algorithm that can solve it to get a general sense of how the different levels compare in difficulty; (4) I tend to get obsessively curious about stuff - wasn't sure if it's feasible to have a 100% real-time procedural gen of chess mazes so I decided to do it to find out.

> I've often heard debates around whether content needs to be "hand-crafted" to be worth another human's time

LLM jokes aside, I think this is a great point and I'm not sure exactly what the right answer is here. I ended up keeping both Classic and Endless modes for this exact reason. If there's enough interest, I'll add a manual level generator for the community of 'humans' to submit their own hand-crafted creations for others to play.

> a general set of principles to procedurally generate levels that apply to games outside of echo chess?

Good question. I think it really comes down to this: (a) can you formalize the concept of a 'level' and its components for your game in a way as to encode/decode every game state (at a minimum) easily and efficiently? (b) can you parametrize this formalism in a way as to connect malleable randomizations to every meaningful component of a level to generate (hypothetical) infinite variability? (c) do you have a confident way to make sure whatever generation of a 'level' you're churning out is as playable, achievable, and as fun as the manually crafted one? The fun part is the hardest one to automate. That's where I think you need a very strong grasp of what actually make your puzzle game fun. All the rest can likely be generalizable to other games.

> puzzle designers would burn out before they create a few dozen-hundred levels

Tell me about it.

formalsystem · on Feb 2, 2021

Am I missing something about crypto lending? Your deposit can't be FDIC insured and if borrowers are defaulting that means your deposit disappears. You can't bailout a crypto bank by printing crypto.

The appreciation of Bitcoin has been so staggering that it also makes getting interest back on your deposits seem rather outdated.

jcpham2 · on Feb 2, 2021

Bitcoin banks have been tried and failed, I should know the US department of Justice sent me victim notification emails about it for years.

Take it from someone who has operated in this space long enough to beware of interest rates on crypto, the money has to come from somewhere to cover the vig

Fractional reserve banking is the thing that is to be avoided and ultimately fails.

Ymmv. I am not a financial advisor blah blah

meowkit · on Feb 2, 2021

You’re not. The ToS for Gemini Earn states exactly that you risk losing your deposit.

You earn interest on top of the underlying appreciation. Doesn’t seem outdated to me.

ac29 · on Feb 2, 2021

As with anything in crypto, there is significant counterparty risk.