This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.
This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.
I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed.
That came out at 0.15 tps Not bad for completely underspecced hardware.
Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck.
I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.
To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.
I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!
I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.
My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.
For a 131GB model, the biggest difference would be to fit it all in RAM, eg get 192GB of RAM. Sorry if this is too obvious, but it's pointless to run an llm if it doesn't fit in ram, even if it's an MOE model. And also obviously, it may take a server motherboard and cpu to fit that much RAM.
I wonder if one could just replicate the "Mac mini LLM cluster" setup over Ethernet of some form and 128GB per node of DDR4 RAM. Used DDR4 RAM with likely dead bits are dirt cheap, but I would imagine that there will be challenges linking systems together.
I wonder if the, now abandoned, Intel Optane drives could help with this. They had very low latency, high IOPS, and decent throughput. They made RAM modules as well. A ram disk made of them might be faster.
Intel PMem really shines for things you need to be non-volatile (preserved when the power goes out) like fast changing rows in a database. As far as I understand it, "for when you need millions of TPS on a DB that can't fit in RAM" was/is the "killer app" of PMem.
Which suggests it wouldn't be quite the right fit here -- the precomputed constants in the model aren't changing, nor do they need to persist.
Still, interesting question, and I wonder if there's some other existing bit of tech that can be repurposed for this.
I wonder if/when this application (LLMs in general) will slow down and stabilize long enough for anything but general purpose components to make sense. Like, we could totally shove model parameters in some sort of ROM and have hardware offload for a transformer, IF it wasn't the case that 10 years from now we might be on to some other paradigm.
I imagine you can get more by striping drives. Depending on what chipset you have, the CPU should handle at least 4. Sucks that no AM4 APU supports PCIe 4 while the platform otherwise does.
I have been doing this with an Epyc 7402 and 512GB of DDR4 and its been fairly performant, you dont have to wait very long to get pretty good results. It's still LLM levels of bad, but at least I dont have to pay $20/mo to OpenAI.
But I use it for many other use cases and hosting protocol based services that would otherwise expose me to advertising or additional service charges. It's just like people that buy solar panels instead of buying service from the power company. You get the benefit with your multi-year ROI.
I built the machine for $5500 four years ago and it certainly has not paid for itself, but it still has tons of utility and will probably last another four years bringing my monthly cost to ~$50/mo which is way lower than what a cloud provider would charge, especially considering egress network traffic. Instead of paying Discord, Twitter, Netflix/Hulu/Amazon/etc, paid game hosting, and ChatGPT, I can self host Jitsi/Matrix, Bluesky, Plex, SteamCMD, and ollama. In total I end up spending about the same, but I have way more control, better access to content, and can do more when offline for internet outages.
Thanks to CloudFlare Tunnel, I dont have to pay a cloud vendor, cdn or vpn for good routes to my web resources or opt into paid DDoS protection services. It's fantastic.
How tokens per second do you get typically? I rent a Hetzner server with EPYC 48-core 9454P, but it's got only 256GB. A typical infer speed is in ballpark of ~6 tokens/second with llama.cpp. Memory is some DDR5 type. I think I have the entire machine for myself as a dedicated server on a rack somewhere but I don't know enough about hosting to say for certainty.
I have to run the, uh, "crispier" very compressed version because otherwise it'll spill into swap. I use the 212GB .gguf one from Unsloth's page, with a name that I can't remember on top of my head but I think it was the largest they made using their specialized quantization for llama.cpp. Jpeggified weights. Actually I guess llama.cpp quantization is a bit closer analogy to the reducing number of colors rather than jpeg-style compression crispiness? Gif had reduced colors (256) IIRC. Heavily gif-like compressed artificial brains. Gifbrained model.
Just like you, I use it for tons of other things that have nothing to do with AI, it just happened to be convenient that Deepseek-R1 came out and just about barely is able to run it on this thing, with enough quality to be coherent. My use otherwise is mostly hosting game servers for my friend groups or other random CPU-heavy projects.
I haven't investigated myself but I've noticed in passing: There is a person on llama.cpp and in /r/localllama who is working on specialized CPU-optimized Deepseek-R1 code, and saw them asking for an EPYC machine for testing, with specific request for a certain configuration. IIRC also said that the optimized version needs new quants to get the speeeds. So maybe this particular model will get some speedup if that effort succeeds.
You probably already know/have done this but just in case (or if someone else reading along isn't aware): if you click the timestamp "<x> ago" text for a comment it forces the "vouch" button to appear.
I've also vouched as it doesn't seem like a comment deserving to be dead at all. For at least this instant it looks like that was enough vouches to restore the comment.
I don't know the specifics of this, but vouching "against" the hive mind leads to your vouches not doing anything any more. I assume that either there's some kind of threshold after which you're shadowbanned from vouching or perhaps there's a kind of vouch weight and "correctly" vouching (comment is not re-flagged) increases it, while wrongly vouching (comment remains flagged or is re-flagged) decreases your weight.
We sometimes take vouching privileges away from accounts that repeatedly vouch for comments that are bad for HN in the sense that they break the site guidelines. That's necessary in order for the system to function—you wouldn't believe some of the trollish and/or abusive stuff that some people vouch for. (Not to mention the usual tricks of using multiple accounts, etc.) But it's nothing to do with the hive mind and it isn't done by the software.
It wasn't downvoted - rather, the account is banned (https://news.ycombinator.com/item?id=42653007) and comments by banned accounts are [dead] by default unless users vouch for them (as described by zamadatix) or mods unkill them (which we do when we see good comments by banned accounts).
Btw, I agree that that was a good comment that deserved vouching! But of course we have to ban accounts because of the worst things they post, not the best.
This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.
[1] https://x.com/carrigmat/status/1884244369907278106?s=46&t=5D...