More

danielhanchen · 2025-01-28T11:39:40 1738064380

I was pleasantly surprised by 140 tokens/s as well! I literally thought I did something wrong but it was real!

danielhanchen · 2025-01-28T11:21:28 1738063288

Oh the repetition issue is only on the non dynamic quants :) If you do dynamic quantization and use the 1.58bit dynamic quantized model the repetition issue fully disappears!

Min_p = 0.05 was a way I found to counteract the 1.58bit model generating singular incorrect tokens which happen around 1 token per 8000!

smcleod · 2025-01-28T12:03:02 1738065782

min_p is great, do you apply a small amount of temperate as well?

Der_Einzige · 2025-01-28T20:13:37 1738095217

Btw, min_p (the paper about the sampler) got accepted to ICLR! As 4th author it warms my heart to so it used so much in the wild.

danielhanchen · 2025-01-28T20:45:05 1738097105

Oh hi!! Congratulations on ICLR!!! min_p = 0.1 and temp = 1.5 is my default goto settings!!

danielhanchen · 2025-01-28T19:46:32 1738093592

The recommended temperature from DeepSeek is 0.6 so I leave it at that!

smcleod · 2025-01-28T21:01:05 1738098065

I think most of the model creators share their model usage examples so high at 0.6-0.7 simply because it's what a lot of the client apps use. IMO this is WAY too high unless you're doing creative writing.

Generally I set temp to 0-0.4 at absolute most.

min_p actually needs a little temperature to work effectively so with min_p I almost always use 0.2

danielhanchen · 2025-01-28T21:15:58 1738098958

Ye lower temp is also good :) Tbh its all trial and error - I found temp=1.5, min_p=0.1 to be very useful for pass@k type workloads - ie calling the LLM multiple times and aggregating.

temp=0 is also good for singular outputs. For classification tasks, it's better to actually inspect the logits.

But my goto setting is always setting min_p at least 0.01 or 0.05! It vastly suppresses incorrect rare random tokens from being created, and it helps massively!

danielhanchen · 2025-01-28T10:59:32 1738061972

There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!

danielhanchen · 2025-01-28T10:54:27 1738061667

Glad they were helpful! :)

danielhanchen · 2025-01-28T10:53:47 1738061627

:) It's my goto test :) I did amp it up by adding 10 conditions and made a scoring card - I found the original R1 to sometimes forget "import os" or miss some lines as well, so I thought it was at least a good check!

I also like to ask the models to create a simple basic Minecraft type game where you can break pieces and store them in your inventory, but disallow building stuff

miohtama · 2025-01-28T11:10:18 1738062618

I feel any AI can fix those problems when they can finally act. The problem AIs cannot run or debug code, or even book a hotel for me. When that is solved and an AI can interact with the code like a human does, it can fix its problems like a human does.

merman · 2025-01-28T12:34:24 1738067664

Exactly! Why can’t LLMs run their own code?

whimsicalism · 2025-01-28T15:29:46 1738078186

they can, feel free to inference and give it an interpreter

Applejinx · 2025-01-28T12:55:08 1738068908

Rampancy.

danielhanchen · 2025-01-28T10:52:03 1738061523

So I remember Deepseek used float8 for training - Character AI also used int8 for training - it is indeed possible, but sometimes training can be unstable - Deepseek to my knowledge is actually the first lab to use float8 at a large scale without causing loss spikes - they used FP8 tensor cores, then every 4th matrix multiply, they accumulated to a FP32 accumulator - it seems like the Hopper Tensor Cores accumulation mechanism might not be actual FP32 accumulation. I wrote more here: https://x.com/danielhanchen/status/1872719599029850391

danielhanchen · 2025-01-28T10:48:44 1738061324

The good thing is since MoEs are mainly memory bound, we just need (VRAM + RAM) to be in the range of 80GB or so in my tests for at least 5 tokens or so /s.

It's better to get (VRAM + RAM) >= 140GB for at least 30 to 40 tokens/s, and if VRAM >= 140GB, then it can approach 140 tokens/s!

Another trick is to accept more than 8 experts per pass - it'll be slower, but might be more accurate. You could even try reducing the # of experts to say 6 or 7 for low FLOP machines!

danielhanchen · 2025-01-28T10:45:43 1738061143

Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.

Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)

danielhanchen · 2025-01-28T10:41:53 1738060913

Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203

I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...

iamnotagenius · 2025-01-28T14:13:01 1738073581

would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.

danielhanchen · 2025-01-28T21:32:56 1738099976

Oh yes multiple people have asked me about this - I'll see what I can do :)

danielhanchen · 2025-01-28T10:38:49 1738060729

Oh yes 192GB machines should be able these quants (131GB for 1.58bit, 158GB for 1.73bit, 183GB for 2.22bit) well :)

bradfox2 · 2025-01-28T13:24:40 1738070680

Great release Daniel. Applaud the consistency you have shown.

Can you release slightly bigger quant versions? Would enjoy something that runs well on 8x32 v100 and 8x80 A100.

danielhanchen · 2025-01-28T21:31:24 1738099884

Thanks! Oh I did release 4bit quants, 5bit, 6bit etc all at https://huggingface.co/unsloth/DeepSeek-R1-GGUF if that helps - they're not dynamic though but it should function fine :)