I’ve been using Ollama with Mixtral-7B on my MBP for local development and it ha...

gnicholas · on Feb 9, 2024

I have used it too and am wondering why it starts responding so much faster than other similar-sized models I've tried. It doesn't seem quite as good as some of the others, but it is nice that the responses start almost immediately (on my 2022 MBA with 16 GB RAM).

Does anyone know why this would be?

regularfry · on Feb 9, 2024

I've had the opposite experience with Mixtral on Ollama, on an intel linux box with a 4090. It's weirdly slow. But I suspect there's something up with ollama on this machine anyway, any model I run with it seems to have higher latency than vLLM on the same box.

kkzz99 · on Feb 9, 2024

You have to specify the amount of layers to put on the GPU with ollama. Ollama defaults to far less layers compared to what is actually possible.

castles · on Feb 9, 2024

To clarify - did you mean Mixtral (8x)7b, or Mistral 7b?

jondwillis · on Feb 9, 2024

MIXtral (8x)-7B