Hacker News new | past | comments | ask | show | jobs | submit login
Everything I've learned so far about running local LLMs (nullprogram.com)
147 points by zdw 77 days ago | hide | past | favorite | 45 comments



"I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies"

Will definitely give llama.cpp a go, great selling point.

I've tried running both Meta Llama and Gpt2 and they both relied on some complex virtualization toolchain of either docker, or a thing called conda, and the dependency list was looong, any issue at any point caused a blockage. I tried on 3 machines, and in a whole day, as a somewhat senior dev I couldn't get it running.


Yes! That's so awesome about llama.cpp. Just get the github repo, fire up your c++ compiler toolchain and not even a minute later...you have a set of tools to do some serious AI shenanigans!

Even adding CUDA capabilities is, although somewhat involved, pretty easy.


As someone who has been running llama.cpp for 2-3 years now (I started with RWKV v3 on python, one of the previous most accessible models due to both cpu and gpu support and the ability to run on older small GPUs, even Kepler era 2GB cards!), I felt the need to point out that only needing llama.cpp binaries and only being 5MB is ONLY true for cpu inference using pre-converted/quantized models. If you are getting a raw trained model from Meta, RWKV, THUD, Bytedance, Microsoft, Alibaba, or any of the big companies releasing open weight (but generally not open source) models to the public, they WILL require python, torch, and dozens to hundreds of prerequisite python modules in order to run the convert.py script to produce an output model.

Should you wish to convert a model yourself, make sure you use BF16 (exceptions apply for natively trained models in FP32, FP16, and 1/1.58 bit native formats) for the majority of the models you convert if you have enough disk space then run llama-quantize on that model to create any quantized models to minimize conversion losses and allow the accuracy vs. performance vs. space considerations that make the most sense for you.

As far as models go, Mistral-2-Large, GLM-4 variants, Mistral-Nemo-8B are my current non-multimodal favorites. llama.cpp doesn't currently support multimodal models unless you use one of the various forks using it as the inference backend due to issues embedding the image tokens in the llama-server implementation. The three models listed have most recently given the most personality when asked to play Colossus (M2L), the best translation between multiple languages while maintaining consistency between translations (GLM-4), and the most obscure code knowledge and annotation capabilities (Mistral-Nemo-8B with CodeGeeX-4-9B, a GLM-4 finetune as a close second). The last two models both were able to answer questions on 16 bit DOS C programming, near and far pointers, and even give assembly examples, although you have to specify very carefully to only emit 8086 or pre-80386 assembly mnemonics to avoid them using e?x variants of ?x registers.

May this comment prove illuminating for one searching for light.


I feel personally attacked ; ) I'm also building a tool to run/build on top of LLMs (https://github.com/singulatron/superplatform) and I opted for containers too. TBF I'm mostly targeting backend developers (who am I kidding, I'm mostly building this for myself).

The desktop version has its own configuration management software to install docker or WSL and all the dependencies you talk about, so I feel your pain.

/self plug


And, while your project looks quite cool, it's way too much and complicated for someone just wanting to try out getting into playing around with LLM and the various text models you can get from sources like huggingface with being somewhat in charge of getting the tools and compiling them on their own.


Having looked at your project, what would you say is difference in ability or philosophy compared to Open Web UI or FlowiseAI? Or, is this "I want to build this because I want to?" To which there is nothing wrong with that.


IMO the simplest option is llamafile (it is multiplatform using "cosmopolitan" lib so should run on Windows too, but I haven't tried)

    wget https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile/resolve/main/Llama-3.2-1B-Instruct.Q6_K.llamafile
    chmod +x Llama-3.2-1B-Instruct.Q6_K.llamafile
    ./Llama-3.2-1B-Instruct.Q6_K.llamafile --server


It has a webui, but this is how I use it from python (sorry I like python, but similar connection method should work from the other langs too).

    ai = openai.AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")
    response = await ai.chat.completions.create(
        messages=[
            {"role": "system", "content": "..."}, {"role": "user", "content": "..."},
        ],
        max_tokens=100,
        model="Llama-3.2-1B-Instruct.Q6_K.gguf",
    )

    content = response.choices[0].message.content


there's also ollama, which I haven't used much yet. they used to have llama.cpp as the only backend, but it appears they've now started to include their own code.


Ollama is kind of ok to get started, but as I understand it they don't give you a choice in the quantisation you'll use. Please correct me if I'm wrong.

One thing I am sure about it is they store large model files renamed as large globally unique identifier, and I still haven't understood that part of the design as anything but some silly obfuscating embrace... And here again, I'd love to be shown how I'm wrong.


All in the name of UX. It’s modeled after Docker, so it defaults to doing things that way. Really does make for great ease of use, imo.


you can, when you search for a model on the ollama website there is a drop down that lets you select a “tag”. Sort of like a docker container tag. This lets you pick the quantization you want.

example: https://ollama.com/library/llama3.2/tags


You can choose the quantization by appending the right tag to the model name, but they don't support other more advanced useful features (e.g. you need a special flag to enable flash attention and you cannot use KV cache quantization for large contexts).


yes, with Llama3.2 vision they are diverging from llama.cpp backend only.


If I understood correctly they just added image preprocessing and then they feed that into llama.cpp...


As a retro PC hobbyist I loved this line:

> Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

I wonder how difficult it would be to compile modern C++ on XP. I may give it a shot and reach out to the author if needed! :)


To be fair the technology is worth billions today, too. Zuckerberg and Meta have scored some points in my book for this open source push.


Although I like the article, the author doesnt acknowledge the current way that (I think) most people are utilizing llama.cpp at this point. ollama.com has simplified his work into two lines:

curl -fsSL https://ollama.com/install.sh |sh

ollama run llama3.2


llamafile simplified it even further - you just download and run it :)


https://github.com/Mozilla-Ocho/llamafile

Wait.. there is one binary that executes just fine on half a dozen platforms? What wizardry is this?

edit: Their default LLM worked great on Windows. Fast inference on my 2080ti. You have to pass "-ngl 9999" to offload onto GPU or it runs on CPU. It is multi-modal too.


cosmopolitan libc, by Justine Tunney (jart here on HN.) How have you missed this?


That's wild. Thank you. I don't know how I missed this.


I have a much more condensed "sysadmin" experience, who summarize as:

- I have found some personal use cases, but no LLMs I've found do really work for such cases;

- those who publish LLM-alike software (also valid for SD and alike for images and co) have no interest in FLOSS, they simply push code with no structure, monsters with deps not handled at all and next to zero documentation, seems more OSS-enterprise trend than FLOSS.

Long story short: my personal use case is find hard-to-find notes (org-mode, maaaaany headings, much of the annotated news in various languages), where hard to find notes is "if I recall correctly I've noted ~$something but still fail to find it both looking for headings (let's say titles) and ripgrepping brutally" and to spot trends "I've noted various natural phenomenons in the last some years, how about the trend of noted floods, wildfires, ...?". In all cases I've managed quicker and better with simply org-roam-node-find (i.e. looking at titles) (+ embark eventually) on results or rg(+embark).

That's is. They might be useful like Alphabet NotebookLM for quickly trying to have a clue on a pdf someone sent to me, but so far I found nothing interesting who not demand more time packaging and keep updating the project and it's deps on my desktop than simply skim papers by myself...


I'm designing a new PC and I'd like to be able to run local models. It's not clear to me from posts online what the specs should be. Do I need 128gb of RAM? Or would a 16gb RTX 4060 be better? Or should I get a 4070 ti? If anyone could pint me toward some good guidelines I'd greatly appreciate it.


The 4070ti 16gb would be better as it has a 256bit bus compared to the 4060ti 16gb which, as has been mentioned, only has a 128bit bus. The 4070ti also has more Tensor cores than the 4060ti.

Get as much VRAM as you can afford.

nVIDIA is also releasing new cards starting in late January 2025. The RTX 50 series.


The answer is that it depends which models you want to run. I'd get as much VRAM on your GPU as possible. Once that runs out, it'll start using your system RAM.

Some good info here if you dig around:

https://www.reddit.com/r/LocalLLaMA/


You can run local models on a 10 year old laptop. As always the answer is "it depends".

The things you need: memory bandwidth, memory capacity, compute. The more of each the better. The 4060 generally has very poor bandwidth (worse than the 3060) due to its limited bus, but being able to offload more is still generally better.

32GB systems can load 8B models at fp16, 12B at 8 bits, 30B at 4 bits, 70B at 2 bits (roughly speaking). 64GB would be a good minimum if you want to use 70B at 4 bits. Without significant offloading it will be very slow though.

If you want to process long contexts in a decent amount of time it's best to run models with flash attention which requires you to have the KV cache on the GPU. It also lets you use 4 bit cache, which quadruples the amount of context you can fit.


GPU VRAM is the bottleneck currently, check out r/localLlama for benchmarks and calculators for what models can fit into what cards approximately


Options:

A) 128GB RAM with the fastest Intel/AMD CPU, no GPU: you can run big/good models, but very slow (about 0.5 to 3 tokens/second)

B) Fastest Mac with 128GB/192GB: you can run big/good models with moderate speed (like 5-10 tokens/second)

C) 16/32GB RAM + RTX 4090 with 24GB VRAM: you can run smaller (but still good) models very fast - completely in VRAM (20-30 tokens/second)


You can create a https://tinygrad.org/#tinybox clone with multiple GPUs.


There's a lot more than the few applications described at the end of the article. Even with smaller sized models, they can achieve many useful tasks when editing text, making summaries (of not too long documents), writing reasonable emails, expand on existing text, add details to a document, change the turn of phrase, imitate someone's writing style... and more!

RAG is a very difficult topic. A basic RAG will just be crap, and fail to answer questions properly most of the time. Once you however accumulate techniques to improve beyond the baseline, it can become something very similar to a very proficient assistant on a specific domain (assuming you indexed the files of interest) and doubles as a local search engine.

LLMs have many limitations, but once you understand their constraints, they can still do a LOT.


could you elaborate on your opinion's on RAGs? my impression is RAGs are the industry's magic bullet to all the downsides and challenges posed by LLMs.


Sure. RAG is an effective tool to make up for limited context lengths and ensure you have relevant information to answer specific questions. But where it is not a magic bullet is that the accuracy that you get from a RAG is by default very low. You can verify this by building a set of questions and related ground truths and checking how much your RAG can get to the truth or close enough. A vanilla RAG system without much added work will hover around 50% or less accuracy, and even lower if you focus only on complex questions that require quite a few different chunks to get to the real answer.

Overall, you don't know how good your RAG is until you test it extensively.


Llama.cpp has Vulkan text generation, so you can use the GPU without any special drivers.

It's at least 10x faster than CPU generation and turns small models (up to 7B parameters) into an experience as fast as any of the commercial services.


Thanks for the wonderful article.

I’ve tried running models locally. I found that collocating the models on my computer/laptp took up too much resource to impact my work. My solution is to run the models on my home servers since they can be served via http. Then run VPN to my home network to access them if I’m on the road. That actually works well and it’s scalable.


  "Inference starts at a comfortable 30 t/s
is this including the context? context: 1000t and instruction: 20t takes (1020/30 s)? or 20/30 s?

  "Second, LLMs have goldfish-sized working memory. ... In practice, an LLM can hold several book chapters worth of comprehension “in its head” at a time. For code it’s 2k or 3k lines (code is token-dense).
That's not exactly goldfish-sized and in fact very useful already.

  "Third, LLMs are poor programmers. At best they write code at maybe an undergraduate student level who’s read a lot of documentation.
Exactly what I want for local code generation.

I think he's anti-hyping a little by pretending LLMs are in fact _not_ super-intelligent and what not. Sure, some people believe that but come on ... we're not on a McKinsey workshop here.

---

Any good German language models out there?


> There are tools like retrieval-augmented generation and fine-tuning to mitigate it… slightly.

On one hand, the imminent arrival of J.A.R.V.I.S. makes me wish I'd digitized more of my personal life. Keeping a daily journal for the past couple decades would have been an amazing corpus for training an intelligent personal LLM.

On the other hand, I often feel like I dodged a bullet by being born just before the era of social-media oversharing, meaning that not all evidence of my life is already online. I've assumed since Her came out that such a product would require giving up all your privacy to Big Tech, Inc.

Articles like this give me hope that there will someday be competent digital second brains that we can run entirely locally, and that it might be time to start that journal... but only in Notepad.


Nice article!

I'm always in doubt: in a Windows computer without a powerful GPU, is it better to run the local models in WSL2 or directly in Windows? Does that fact that is an ARM machine makes any difference?


I found WSL2 to have similar performance to native Windows. The only slowdown is loading files from the Windows side via /mnt/ which goes across the VM boundary. Move the files to the Linux side and it’s good.


Is it hard to just benchmark it each way and find out objectively yourself?


To the author:

> which smashes the Turing test and can be .

Looks like an incomplete sentence?


"... and can be used for everything from writing poetry and code to analyzing complex data and solving mathematical problems."

Provided by claude-3-sonnet-20241022.

P.S. This was generated only with providing the paragraph where the sentence appears as context. It will likely have a different output had the entire text been considered, especially with the generative claim that the LLMs can produce code, on which point the author considers LLM production to be weak.


Good write-up.

Anyone want to give a definition of GGUF without using IBM's definition (also appears first in my search results)?


I was under the impression that it was simply the file format used by llama.cpp and ggml, name inspired by the name of the author (https://github.com/ggerganov): https://github.com/ggerganov/ggml/blob/master/docs/gguf.md

He prefixes everything with “gg” (his initials).

EDIT: Confirmed: https://github.com/ggerganov/ggml/issues/220

The UF stands for Unified Format.


flash-attention is a additional and later technique to accelerate computation of attention, That's why it's not the default option for llama.cpp




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: