Hacker News new | past | comments | ask | show | jobs | submit login

There is a lot I want to do with LLMs locally, but it seems like we're still not quite there hardware-wise (well, within reasonable cost). For example, Llama's smaller models take upwards of 20 seconds to generate a brief response on a 4090; at that point I'd rather just use an API to a service that can generate it in a couple seconds.



This is very wrong. Smaller models would generate response instantly on 4090. I run 3090's and easily get 30-50 tokens/seconds with small models. Folks with 4090 will see easily 80 tokens/sec for 7-8b model in Q8 and probably 120-160 for 3B models. Faster than most public APIs


8B models are pretty fast even on something like a 3060 depending on deployment method (for example Q4 on Ollama).


They're fast enough for me on CPU even.


What CPU and memory specs do you have?


Sure, except those smaller models are only useful for some novelty and creative tasks. Give them a programming or logic problem and they fall flat on their face.

As I mentioned in a comment upthread, I find ~30B models to be the minimum for getting somewhat reliable output. Though even ~70B models pale in comparison with popular cloud LLMs. Local LLMs just can't compete with the quality of cloud services, so they're not worth using for most professional tasks.


They might be talking about 70B


Wait, is 70b considered on the smaller side these days?


Relative to deepseek's latest, yeah


You should absolutely be getting faster responses with a 4090. But that points to another advantage of cloud services—you don't have to debug your own driver issues.


To me it's not even about performance (speed). It's just that the quality gap between cloud LLM services and local LLMs is still quite large, and seems to be increasing. Local LLMs have gotten better in the past year, but cloud LLMs have even more so. This is partly because large companies can afford to keep throwing more compute at the problem, while quality at smaller scale deployments is not increasing at the same pace.

I have a couple of 3090s and have tested most of the popular local LLMs (Llama3, DeepSeek, Qwen, etc.) at the highest possible settings I can run them comfortably (~30B@q8, or ~70B@q4), and they can't keep up with something like Claude 3.5 Sonnet. So I find myself just using Sonnet most of the time, instead of fighting with hallucinated output. Sonnet still hallucinates and gets things wrong a lot, but not as often as local LLMs do.

Maybe if I had more hardware I could run larger models at higher quants, but frankly, I'm not sure it would make a difference. At the end of the day, I want these tools to be as helpful as possible and not waste my time, and local LLMs are just not there yet.


This is the right way of looking at it. Unfortunately (or fortunately for us?) most people don't realize how precious their time is.


Depends on the model, if it doesn't fit into VRAM performance will suffer. Response here is immediate (at ~15 tokens/sec) on a pair of ebay RTX 3090s in an ancient i3770 box.

If your model does fit into VRAM, if its getting ejected there will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1 (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...).


>> within reasonable cost

> pair of ebay RTX 3090s

So... 1700 USD?


£1200 UKP, so a little less. Targetted at having 48GB (2x24Gb) VRAM for running the larger models; having said that, a single 12Gb RTX3060 in another box seems pretty close in local testing (with smaller models).


If you're looking for most bang for the buck 2x3060(12Gb) might be the best bet. GPUs will be around $400-$600.


Have been trying forever to find a coherent guide on building dual-GPU box for this purpose, do you know of any? Like selecting the MB, the case, cooling, power supply and cables, any special voodoo required to pair the GPUs etc.


I'm not aware of any particular guides, the setup here was straightforward - an old motherboard with two PCIe X16 slots (Asus P8Z77V or P8Z77WS), a big enough power supply (Seasonic 850W) and the stock linux Nividia drivers. The RTX 3090's are basic Dell models (i.e. not OC'ed gamer versions), and worth noting they only get hot if used continuously - if you're the only one using them, the fans spin up during a query and back down between. Good 'smoke test' is something like 'while 1; do 'ollama run llama3.3 "Explain cosmology"'; done.

With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and the model uses about 44Gb; so the first start is slow (loading the model into VRAM) but after that response is fast (subject to comment above about KEEP_ALIVE).


The consumer hardware channel just hasn't caught up yet -- we'll see a lot more desktop kit appear in 2025 on retail sites (there is a small site in UK selling Nvidia A100 workstations for £100K+ each on a Shopify store)

Seem to remember a similar point in late 90s and having to build boxes to run NT/SQL7.0 for local dev

Expect there will be a swing back to on-prem once enterprise starts moving faster and the legal teams begin to understand what is happening data-side with RAG, agents etc.


> Nvidia A100 workstations for £100

This seems extremely overpriced.


I've not looked properly but I think it's 4 x 80gb A100s



Even on CPU, you should get the start of a response within 5 seconds for Q4 8B-or-smaller Llama models (proportionally faster for smaller ones), which then stream at several tokens per second.

There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?


My gut feeling is that there may be optimization you can do for faster performance (but I could be wrong since I don't know your setup or requirements). In general on a 4090 running between Q6-Q8 quants my tokens/sec have been similar to what I see on cloud providers (for open/local models). The fastest local configuration I've tested is Exllama/TabbyAPI with speculative-decoding (and quantized cache to be able to fit more context)


You may have been running with CPU inference, or running models that don’t fit your VRAM.

I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.

Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).


yeah many people don't understand how cheap it is to use the chatgpt API

not to mention all of the other benefits of delegating all the work of setting up the GPUs, public HTTP server, designing the API, security, keeping the model up-to-date with the state of the art, etc

reminds me of the people in the 2000s / early 2010s who would build their own linux boxes back when the platform was super unstable, constantly fighting driver issues etc instead of just getting a mac.

roll-your-own-LLM makes even less sense. at least for those early 2000s linux guys, even if you spent an ungodly amount of time going through the arch wiki or compiling gentoo or whatever, at least those skills are somewhat transferrable to sysadmin/SRE. i dont see how setting up your own instance of ollama has any transferable skills

the only way i could see it making sense is if you're doing super cutting edge stuff that necessitates owning a tinybox, or if you're trying to get a job at openAI or anysphere


Depends on what you value. Many people value keeping general purpose computing free/libre and available to as many as possible. This means using free systems and helping those systems mature.


If you're in a position like mitchellh or antirez then more power to you


I would agree with that if I didn't mind handing over my prompts/data to big tech companies.

But I do.


There are similar superstitions around proprietary software, I hope one day you are able to overcome it before you waste too much of your precious time


In what way is wanting to keep my prompts and data private superstition?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: