There is a lot I want to do with LLMs locally, but it seems like we're still not...

segmondy · 2024-12-29T19:44:11 1735501451

This is very wrong. Smaller models would generate response instantly on 4090. I run 3090's and easily get 30-50 tokens/seconds with small models. Folks with 4090 will see easily 80 tokens/sec for 7-8b model in Q8 and probably 120-160 for 3B models. Faster than most public APIs

hedgehog · 2024-12-29T20:07:09 1735502829

8B models are pretty fast even on something like a 3060 depending on deployment method (for example Q4 on Ollama).

r-w · 2024-12-30T04:33:07 1735533187

They're fast enough for me on CPU even.

deskamess · 2024-12-30T12:45:24 1735562724

What CPU and memory specs do you have?

imiric · 2024-12-29T22:54:04 1735512844

Sure, except those smaller models are only useful for some novelty and creative tasks. Give them a programming or logic problem and they fall flat on their face.

As I mentioned in a comment upthread, I find ~30B models to be the minimum for getting somewhat reliable output. Though even ~70B models pale in comparison with popular cloud LLMs. Local LLMs just can't compete with the quality of cloud services, so they're not worth using for most professional tasks.

kolbe · 2024-12-29T20:56:14 1735505774

They might be talking about 70B

nobodyandproud · 2024-12-29T21:33:16 1735507996

Wait, is 70b considered on the smaller side these days?

kolbe · 2024-12-29T21:34:45 1735508085

Relative to deepseek's latest, yeah

do_not_redeem · 2024-12-29T18:53:14 1735498394

You should absolutely be getting faster responses with a 4090. But that points to another advantage of cloud services—you don't have to debug your own driver issues.

imiric · 2024-12-29T22:45:45 1735512345

To me it's not even about performance (speed). It's just that the quality gap between cloud LLM services and local LLMs is still quite large, and seems to be increasing. Local LLMs have gotten better in the past year, but cloud LLMs have even more so. This is partly because large companies can afford to keep throwing more compute at the problem, while quality at smaller scale deployments is not increasing at the same pace.

I have a couple of 3090s and have tested most of the popular local LLMs (Llama3, DeepSeek, Qwen, etc.) at the highest possible settings I can run them comfortably (~30B@q8, or ~70B@q4), and they can't keep up with something like Claude 3.5 Sonnet. So I find myself just using Sonnet most of the time, instead of fighting with hallucinated output. Sonnet still hallucinates and gets things wrong a lot, but not as often as local LLMs do.

Maybe if I had more hardware I could run larger models at higher quants, but frankly, I'm not sure it would make a difference. At the end of the day, I want these tools to be as helpful as possible and not waste my time, and local LLMs are just not there yet.

jhatemyjob · 2024-12-31T03:19:56 1735615196

This is the right way of looking at it. Unfortunately (or fortunately for us?) most people don't realize how precious their time is.

zh3 · 2024-12-29T19:14:34 1735499674

Depends on the model, if it doesn't fit into VRAM performance will suffer. Response here is immediate (at ~15 tokens/sec) on a pair of ebay RTX 3090s in an ancient i3770 box.

If your model does fit into VRAM, if its getting ejected there will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1 (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...).

e12e · 2024-12-29T19:20:12 1735500012

>> within reasonable cost

> pair of ebay RTX 3090s

So... 1700 USD?

zh3 · 2024-12-29T21:58:38 1735509518

£1200 UKP, so a little less. Targetted at having 48GB (2x24Gb) VRAM for running the larger models; having said that, a single 12Gb RTX3060 in another box seems pretty close in local testing (with smaller models).

drillsteps5 · 2024-12-29T20:02:01 1735502521

If you're looking for most bang for the buck 2x3060(12Gb) might be the best bet. GPUs will be around $400-$600.

drillsteps5 · 2024-12-29T19:59:03 1735502343

Have been trying forever to find a coherent guide on building dual-GPU box for this purpose, do you know of any? Like selecting the MB, the case, cooling, power supply and cables, any special voodoo required to pair the GPUs etc.

zh3 · 2024-12-29T21:54:29 1735509269

I'm not aware of any particular guides, the setup here was straightforward - an old motherboard with two PCIe X16 slots (Asus P8Z77V or P8Z77WS), a big enough power supply (Seasonic 850W) and the stock linux Nividia drivers. The RTX 3090's are basic Dell models (i.e. not OC'ed gamer versions), and worth noting they only get hot if used continuously - if you're the only one using them, the fans spin up during a query and back down between. Good 'smoke test' is something like 'while 1; do 'ollama run llama3.3 "Explain cosmology"'; done.

With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and the model uses about 44Gb; so the first start is slow (loading the model into VRAM) but after that response is fast (subject to comment above about KEEP_ALIVE).

mtkd · 2024-12-29T20:13:29 1735503209

The consumer hardware channel just hasn't caught up yet -- we'll see a lot more desktop kit appear in 2025 on retail sites (there is a small site in UK selling Nvidia A100 workstations for £100K+ each on a Shopify store)

Seem to remember a similar point in late 90s and having to build boxes to run NT/SQL7.0 for local dev

Expect there will be a swing back to on-prem once enterprise starts moving faster and the legal teams begin to understand what is happening data-side with RAG, agents etc.

fooker · 2024-12-29T20:23:15 1735503795

> Nvidia A100 workstations for £100

This seems extremely overpriced.

mtkd · 2024-12-29T20:38:16 1735504696

I've not looked properly but I think it's 4 x 80gb A100s

ojbyrne · 2024-12-29T20:53:12 1735505592

Maybe this?

https://www.ctoservers.com/nvidia-dgx-station-a100---quad-nv...

£ 80k

o11c · 2024-12-29T20:28:56 1735504136

Even on CPU, you should get the start of a response within 5 seconds for Q4 8B-or-smaller Llama models (proportionally faster for smaller ones), which then stream at several tokens per second.

There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?

pickettd · 2024-12-29T18:38:18 1735497498

My gut feeling is that there may be optimization you can do for faster performance (but I could be wrong since I don't know your setup or requirements). In general on a 4090 running between Q6-Q8 quants my tokens/sec have been similar to what I see on cloud providers (for open/local models). The fastest local configuration I've tested is Exllama/TabbyAPI with speculative-decoding (and quantized cache to be able to fit more context)

nobodyandproud · 2024-12-29T21:32:16 1735507936

You may have been running with CPU inference, or running models that don’t fit your VRAM.

I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.

Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).

jhatemyjob · 2024-12-29T20:22:15 1735503735

yeah many people don't understand how cheap it is to use the chatgpt API

not to mention all of the other benefits of delegating all the work of setting up the GPUs, public HTTP server, designing the API, security, keeping the model up-to-date with the state of the art, etc

reminds me of the people in the 2000s / early 2010s who would build their own linux boxes back when the platform was super unstable, constantly fighting driver issues etc instead of just getting a mac.

roll-your-own-LLM makes even less sense. at least for those early 2000s linux guys, even if you spent an ungodly amount of time going through the arch wiki or compiling gentoo or whatever, at least those skills are somewhat transferrable to sysadmin/SRE. i dont see how setting up your own instance of ollama has any transferable skills

the only way i could see it making sense is if you're doing super cutting edge stuff that necessitates owning a tinybox, or if you're trying to get a job at openAI or anysphere

eikenberry · 2024-12-29T20:50:27 1735505427

Depends on what you value. Many people value keeping general purpose computing free/libre and available to as many as possible. This means using free systems and helping those systems mature.

jhatemyjob · 2024-12-30T00:22:19 1735518139

If you're in a position like mitchellh or antirez then more power to you

grahamj · 2024-12-30T02:33:44 1735526024

I would agree with that if I didn't mind handing over my prompts/data to big tech companies.

But I do.

jhatemyjob · 2024-12-31T03:21:44 1735615304

There are similar superstitions around proprietary software, I hope one day you are able to overcome it before you waste too much of your precious time

grahamj · 2024-12-31T04:37:02 1735619822

In what way is wanting to keep my prompts and data private superstition?