Phi-4 Bug Fixes

danielhanchen · 2025-01-10T21:17:21 1736543841

Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini

1. End of sentence should be <|im_end|> not <|endoftext|>

2. Chat template should not auto add an assistant prompt

3. Padding token should not be EOS but <|dummy_87|>

I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth

I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...

CGamesPlay · 2025-01-11T00:38:51 1736555931

> We converted Phi-4 to Llama’s architecture for better accuracy and easier use.

What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?

danielhanchen · 2025-01-11T01:15:46 1736558146

Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.

Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.

So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.

The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.

Sn0wCoder · 2025-01-11T00:47:56 1736556476

Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.

danielhanchen · 2025-01-11T01:16:23 1736558183

Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!

sunaookami · 2025-01-10T23:05:35 1736550335

Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.

>to be on par with GPT-4o mini

Phi is known to overfit benchmarks. It's way, way worse then that.

danielhanchen · 2025-01-11T00:41:20 1736556080

Phi-3 should be fixed as well - but yes there were bugs as well! https://x.com/danielhanchen/status/1782853167572832650

Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct

throwaway314155 · 2025-01-10T23:34:09 1736552049

Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.

danielhanchen · 2025-01-11T00:43:29 1736556209

Yep Phi-4 definitely is better than Phi-3.5!

simonw · 2025-01-10T23:30:40 1736551840

Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.

danielhanchen · 2025-01-11T00:35:22 1736555722

Oh yes exactly! I trimmed it out now :)

The better chat template should be:

{% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}

sroussey · 2025-01-11T01:18:21 1736558301

Can you convert to ONNX so I can try in web browser?

sroussey · 2025-01-11T01:19:37 1736558377

Would like to update this:

https://huggingface.co/spaces/webml-community/phi-3.5-webgpu

danielhanchen · 2025-01-11T01:27:09 1736558829

Oh I can probs try doing this!

danielhanchen · 2025-01-11T02:34:20 1736562860

Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21

NooneAtAll3 · 2025-01-11T03:26:09 1736565969

Application Error

TypeError: m(...).findLast is not a function

at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)

at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)

at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)

at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)

at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)

at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)

at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)

at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)

at MessagePort.M (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235

danielhanchen · 2025-01-11T03:40:39 1736566839

Sorry are there some issues with our website?

t1amat · 2025-01-10T23:39:38 1736552378

Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.

Unsloth is a masterpiece, keep up the great work!

danielhanchen · 2025-01-11T00:43:59 1736556239

Thanks a lot!

excerionsforte · 2025-01-11T02:54:00 1736564040

Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth

danielhanchen · 2025-01-11T02:56:11 1736564171

Oh fabulous! :)

wsintra2022 · 2025-01-11T02:06:03 1736561163

>Reddit comments show our fixes make Phi-4 inference much better

I’d like to try ‘Reddit comments show my fixes make app better’ in my next review

danielhanchen · 2025-01-11T02:28:13 1736562493

Fixed versions are also independently scored by Hugging Face's Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

The Reddit LocalLlama community is actually pretty cool - tonnes of research actually comes from the community - for example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE Scaling, many LLM benchmarks - many researchers use LocalLlama to share research and discuss on new stuff.

I know a lot of AI researchers use the "LocalLlama vibe check" which essentially is an anecdotal approach to LLM evaluation - ie instead of relying on Chat LMsys or LLM benchmarks, 3rd party crowd sourced vibe checks sometimes do much better.

danielhanchen · 2025-01-11T02:35:23 1736562923

As an update - the Phi-4 team is actively working on incorporating all fixes! See https://huggingface.co/microsoft/phi-4/discussions/21

lostmsu · 2025-01-10T23:11:39 1736550699

The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4

According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.

danielhanchen · 2025-01-11T00:50:11 1736556611

Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.

It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue

adultSwim · 2025-01-11T00:53:18 1736556798

Are there alternatives to unsloth?

I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.

Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.

danielhanchen · 2025-01-11T01:17:48 1736558268

Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!

adultSwim · 2025-01-11T04:44:28 1736570668

Thank you!

make3 · 2025-01-11T00:27:25 1736555245

"Yes it improves performance!" proceeds to show the most unconvincing stats ever

you can probably blow on your GPU and get a similar performance change

danielhanchen · 2025-01-11T01:10:24 1736557824

I uploaded our fixed versions to https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... which show the difference in scores.

I agree it's not super convincing, so I provided anecdotal evidence as well - I'll work with the Phi-4 team to upstream these fixes!

PS for further credibility, we also fixed 8 bugs in Gemma 1 - see https://x.com/danielhanchen/status/1765446273661075609 , multiple bugs in Llama, Mistral, Qwen and other models

refulgentis · 2025-01-11T00:54:45 1736556885

I'm sorry, I don't understand what you mean. I checked the original article again too. As it stands, my understanding is you are claiming:

- blowing on a GPU (which I take to mean doing roughly nothing)

- gets roughly the same perf change

- as moving from fp16 to q4

danielhanchen · 2025-01-11T01:23:06 1736558586

Are you referring to the finetuning part?

The multiple bug fixes are separate from the finetuning sections - Unsloth itself makes finetuning 2x faster and use 70% less memory - the bug fixes are totally detached from finetuning - ie you can take the fixed version we uploaded at https://huggingface.co/unsloth/phi-4, and use it in any framework or inference engine.

Apologies I'm confused on the comment sorry.

If you're questioning the credibility of the bug fixes - we fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 and much more

grumpopotamus · 2025-01-11T04:29:52 1736569792

2x faster than what?

danielhanchen · 2025-01-11T04:40:09 1736570409

Oh 2x faster and uses >70% less memory than Hugging Face + Flash Attention 2! I did a CUDA / GPU Mode talk about it here: https://www.youtube.com/watch?v=hfb_AIhDYnA Also to the PyTorch team here: https://www.youtube.com/watch?v=MQwryfkydc0 and the PyTorch Conference here: https://www.youtube.com/watch?v=PdtKkc5jB4g

danielhanchen · 2025-01-11T02:36:52 1736563012

Update - the Phi-4 team is working on adding all our fixes to the original model! https://huggingface.co/microsoft/phi-4/discussions/21

TZubiri · 2025-01-11T00:18:56 1736554736

Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.

danielhanchen · 2025-01-11T00:53:28 1736556808

Anecdotal evidence was provided to show some Redditors tested it out - but I do agree it's not correct to show that as an example - so I uploaded our fixed versions to Hugging Face's public LLM leaderboard here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... - this shows the fixes do in fact work!