Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’ve been using Ollama with Mixtral-7B on my MBP for local development and it has been amazing.


I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

Does anyone know why this would be?


I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.


You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.


To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?


MIXtral (8x)-7B




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: