Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've always used `llamacpp -m <model> -p <prompt>`. Works great as my daily driver of Mixtral 8x7b + CodeLlama 70b on my MacBook. Do alternatives have any killer features over Llama.cpp? I don't want to miss any cool developments.


70b is probably going to be a bit slow for most on M-series MBPs (even with enough RAM), but Mixtral 8x7b does really well. Very usable @ 25-30T/s (64GB M1 Max), whereas 70b tends to run more like 3.5-5T/s.

'llama.cpp-based' generally seems like the norm.

Ollama is just really easy to set up & get going on MacOS. Integral support like this means one less thing to wire up or worry about when using a local LLM as a drop-in replacement for OpenAI's remote API. Ollama also has a model library[1] you can browse & easily retrieve models from.

Another project, Ollama-webui[2] is a nice webui/frontend for local LLM models in Ollama - it supports the latest LLaVA for multimodal image/prompt input, too.

[1] https://ollama.ai/library/mixtral

[2] https://github.com/ollama-webui/ollama-webui


Yeah, ollama-webui is an excellent front end and the team was responsive in fixing a bug I reported in a couple of days

It's also possible to connect to OpenAI API and use GPT-4 on per token plan. I cancelled my chatGPT subscription since. But 90% of the usage for me is Mistral 7B fine-tunes, I rarely use OpenAI


Thanks for that idea, I use Ollama as my main LLM driver, but I still use OpenAI, Anthropic, and Mistral commercial API plans. I access Ollama via a REST API and my own client code, but I will try their UI.

re: cancelling ChatGPT subscription: I am tempted to do this also except I suspect that when they release GPT-5 there may be a waiting list, and I don’t want any delays in trying it out.


I have found deepseek coder 33B to be better than codellama 70B (personal opinion tho).. I think the best parts of deepseek are around the fact that it understands multi-file context the best.


Same here, I run deepseek coder 33b on my 64GB M1 Max at about 7-8t/s and it blows all other models I've tried for coding. It feels like magic and cheating at the same time, getting these lenghty and in-depth answers with activity monitor showing 0 network IO.


I tried running Deepseek 33b using llama.cpp with 16k context and it kept injecting unrelated text. What is your setup so it works for you? Do you have some special CLI flags or prompt format?


I use the default prompt template which is defined in the tokenizer.config https://huggingface.co/deepseek-ai/deepseek-coder-33b-instru...

No special flags or anything, just the standard format. Do take care of the spaces and end of lines. sharing a gist of the function I use for formatting it: https://gist.github.com/theskcd/a3948d4062ed8d3e697121cabd65... (hope this helps!)


I actually use lmstudio with settings preset for deepseek that comes with it, except for mlock set to keep it entirely in memory, works really good


How exactly do you use the LLM with multiple files? Do you copy them enterly into the prompt?


not really copying the whole files, but I work in an editor which has keyboard shortcuts for adding the relevant context from files and putting that in the prompts.

Straight up giving large files tends to degrade performance (so you need to do some reranking on the snippets before sending them over


Based on a day's worth of kicking tires, I'd say no -- once you have a mix that supports your workflow the cool developments will probably be in new models.

I just played around with this tool and it works as advertised, which is cool but I'm up and running already. (For anyone reading this though who, like me, doesn't want to learn all the optimization work... I might see which one is faster on your machine)


With all the models I tried there was a quite a bit of fiddling for each one to get the correct command-line flags and a good prompt, or at least copy-paste some command-line from HF. Seems like every model needs its own unique prompt to give good results? I guess that is what the wrappers take care of? Other than that llama.cpp is very easy to use. I even run it on my phone in Termux, but only with a tiny model that is more entertaining than useful for anything.


For the chat models, they're all finetuned slightly differently in their prompt format - see Llama's. So having a conversion between the OAI api that everyone's used to now and the slightly inscrutable formats of models like Llama is very helpful - though much like langchain and its hardcoded prompts everywhere, there's probably some subjectivity and you may be rewarded by formatting prompts directly.


The slight incompatibilities of prompt formats and style is a nuisance. I have just been looking an Mistral’s prompt design documentation and I now feel like I have underutilized mistral-7B and mixtral-8-7b https://docs.mistral.ai/guides/prompting-capabilities/


ollama is extremely convenient wrapper around llamacpp.

they separate serving heavy weights from model definition and usage itself.

what that means is weights of some model, let's say mixtral are loaded on the server process (and kept in memory for 5m as default) and you interact with it by using modelfile (inspired by dockerfile) - all your modelfiles that inherit FROM mixtral will reuse those weights already loaded in memory, so you can instantly swap between different system prompts etc - those appear as normal models to use through cli or ui.

the effect is that you have very low latency, very good interface - for programming api and ui.

ps. it's not only for macs

open weight models + (llama.app) as ollama + ollama-webui = real openai.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: