This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single ...

nielsole · 2025-02-01T12:27:01 1738412821

I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed. That came out at 0.15 tps Not bad for completely underspecced hardware.

Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck. I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.

TeMPOraL · 2025-02-01T13:20:16 1738416016

To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.

I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!

I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.

My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.

Eisenstein · 2025-02-01T14:50:27 1738421427

More channels > faster ram.

Some math:

DDR5 6000 is 3000mhz x 2 (double data rate) x 64 bits / 8 for bytes = 48000 /1000 = 48GB/s

DDR3 1866 is 933mhz x 2 x 64 / 8 / 1000 = 14.93GB/s. If you have 4 channels that is 4 x 14.93 = 59.72GB/s

kristianp · 2025-02-01T23:05:04 1738451104

For a 131GB model, the biggest difference would be to fit it all in RAM, eg get 192GB of RAM. Sorry if this is too obvious, but it's pointless to run an llm if it doesn't fit in ram, even if it's an MOE model. And also obviously, it may take a server motherboard and cpu to fit that much RAM.

numpad0 · 2025-02-01T23:52:48 1738453968

I wonder if one could just replicate the "Mac mini LLM cluster" setup over Ethernet of some form and 128GB per node of DDR4 RAM. Used DDR4 RAM with likely dead bits are dirt cheap, but I would imagine that there will be challenges linking systems together.

conor_mc · 2025-02-01T17:51:38 1738432298

I wonder if the, now abandoned, Intel Optane drives could help with this. They had very low latency, high IOPS, and decent throughput. They made RAM modules as well. A ram disk made of them might be faster.

loxias · 2025-02-01T18:17:44 1738433864

Intel PMem really shines for things you need to be non-volatile (preserved when the power goes out) like fast changing rows in a database. As far as I understand it, "for when you need millions of TPS on a DB that can't fit in RAM" was/is the "killer app" of PMem.

Which suggests it wouldn't be quite the right fit here -- the precomputed constants in the model aren't changing, nor do they need to persist.

Still, interesting question, and I wonder if there's some other existing bit of tech that can be repurposed for this.

I wonder if/when this application (LLMs in general) will slow down and stabilize long enough for anything but general purpose components to make sense. Like, we could totally shove model parameters in some sort of ROM and have hardware offload for a transformer, IF it wasn't the case that 10 years from now we might be on to some other paradigm.

smcleod · 2025-02-01T13:59:52 1738418392

I get around 4-5t/s with the unsloth 1.58bit quant on my home server that has 2x3090 and 192GB of DDR5 Ryzen 9, usable but slow.

segmondy · 2025-02-01T20:01:52 1738440112

how much context size?

smcleod · 2025-02-01T20:03:33 1738440213

Just 4K. Because deepseek doesn't allow for the use of flash attention it means you can't run quantised qkv

baobun · 2025-02-01T13:57:40 1738418260

I imagine you can get more by striping drives. Depending on what chipset you have, the CPU should handle at least 4. Sucks that no AM4 APU supports PCIe 4 while the platform otherwise does.

geertj · 2025-02-01T13:20:34 1738416034

> I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.

Per o3-mini, the blocked gemm (matrix multiply) operations have very good locality and therefore MT/s should matter much more than CAS latency.

iwontberude · 2025-02-02T03:47:52 1738468072

I have been doing this with an Epyc 7402 and 512GB of DDR4 and its been fairly performant, you dont have to wait very long to get pretty good results. It's still LLM levels of bad, but at least I dont have to pay $20/mo to OpenAI.

whatevaa · 2025-02-02T08:10:58 1738483858

I don't think the cost of such machine will ever be a better than $20/mo, though. Capital costs are too high.

iwontberude · 2025-02-04T23:47:50 1738712870

But I use it for many other use cases and hosting protocol based services that would otherwise expose me to advertising or additional service charges. It's just like people that buy solar panels instead of buying service from the power company. You get the benefit with your multi-year ROI.

I built the machine for $5500 four years ago and it certainly has not paid for itself, but it still has tons of utility and will probably last another four years bringing my monthly cost to ~$50/mo which is way lower than what a cloud provider would charge, especially considering egress network traffic. Instead of paying Discord, Twitter, Netflix/Hulu/Amazon/etc, paid game hosting, and ChatGPT, I can self host Jitsi/Matrix, Bluesky, Plex, SteamCMD, and ollama. In total I end up spending about the same, but I have way more control, better access to content, and can do more when offline for internet outages.

Thanks to CloudFlare Tunnel, I dont have to pay a cloud vendor, cdn or vpn for good routes to my web resources or opt into paid DDoS protection services. It's fantastic.

adeon · 2025-02-06T07:54:48 1738828488

How tokens per second do you get typically? I rent a Hetzner server with EPYC 48-core 9454P, but it's got only 256GB. A typical infer speed is in ballpark of ~6 tokens/second with llama.cpp. Memory is some DDR5 type. I think I have the entire machine for myself as a dedicated server on a rack somewhere but I don't know enough about hosting to say for certainty.

I have to run the, uh, "crispier" very compressed version because otherwise it'll spill into swap. I use the 212GB .gguf one from Unsloth's page, with a name that I can't remember on top of my head but I think it was the largest they made using their specialized quantization for llama.cpp. Jpeggified weights. Actually I guess llama.cpp quantization is a bit closer analogy to the reducing number of colors rather than jpeg-style compression crispiness? Gif had reduced colors (256) IIRC. Heavily gif-like compressed artificial brains. Gifbrained model.

Just like you, I use it for tons of other things that have nothing to do with AI, it just happened to be convenient that Deepseek-R1 came out and just about barely is able to run it on this thing, with enough quality to be coherent. My use otherwise is mostly hosting game servers for my friend groups or other random CPU-heavy projects.

I haven't investigated myself but I've noticed in passing: There is a person on llama.cpp and in /r/localllama who is working on specialized CPU-optimized Deepseek-R1 code, and saw them asking for an EPYC machine for testing, with specific request for a certain configuration. IIRC also said that the optimized version needs new quants to get the speeeds. So maybe this particular model will get some speedup if that effort succeeds.

3abiton · 2025-02-01T12:29:39 1738412979

3x the price for less than 2x the speed increase. I don't think the price justifies the upgrade.

phonon · 2025-02-01T12:38:39 1738413519

Q4 vs Q8.

m348e912 · 2025-02-01T14:53:06 1738421586

> TacticalCoder 14 minutes ago [dead] | root | parent | prev | next [–]

>

> TFA says it can bump the spec to 768 GB but that it's then more like > $2500 than $2000. At 768 GB that'd be the full, 8 bit, model.

> Seems indeed like a good price compared to $6000 for someone who wants to hack a build.

> I mean: $6 K is doable but I take it take many who'd want to build such a machine for fun would prefer to only fork $2.5K.

.

I am not sure why TacticalCoder's comment was downvoted to oblivion. I would have upvoted if the comment wasn't already dead.

zamadatix · 2025-02-01T15:34:23 1738424063

You probably already know/have done this but just in case (or if someone else reading along isn't aware): if you click the timestamp "<x> ago" text for a comment it forces the "vouch" button to appear.

I've also vouched as it doesn't seem like a comment deserving to be dead at all. For at least this instant it looks like that was enough vouches to restore the comment.

formerly_proven · 2025-02-01T17:31:05 1738431065

I don't know the specifics of this, but vouching "against" the hive mind leads to your vouches not doing anything any more. I assume that either there's some kind of threshold after which you're shadowbanned from vouching or perhaps there's a kind of vouch weight and "correctly" vouching (comment is not re-flagged) increases it, while wrongly vouching (comment remains flagged or is re-flagged) decreases your weight.

dang · 2025-02-01T21:05:08 1738443908

We sometimes take vouching privileges away from accounts that repeatedly vouch for comments that are bad for HN in the sense that they break the site guidelines. That's necessary in order for the system to function—you wouldn't believe some of the trollish and/or abusive stuff that some people vouch for. (Not to mention the usual tricks of using multiple accounts, etc.) But it's nothing to do with the hive mind and it isn't done by the software.

dang · 2025-02-01T21:02:53 1738443773

It wasn't downvoted - rather, the account is banned (https://news.ycombinator.com/item?id=42653007) and comments by banned accounts are [dead] by default unless users vouch for them (as described by zamadatix) or mods unkill them (which we do when we see good comments by banned accounts).

Btw, I agree that that was a good comment that deserved vouching! But of course we have to ban accounts because of the worst things they post, not the best.

bee_rider · 2025-02-01T17:17:48 1738430268

I mean, nothing ever actually scales linearly, right?

TacticalCoder · 2025-02-01T14:32:24 1738420344

TFA says it can bump the spec to 768 GB but that it's then more like $2500 than $2000. At 768 GB that'd be the full, 8 bit, model.

Seems indeed like a good price compared to $6000 for someone who wants to hack a build.

I mean: $6 K is doable but I take it take many who'd want to build such a machine for fun would prefer to only fork $2.5K.

manmal · 2025-02-01T16:50:10 1738428610

The Q8 model will likely slow this down to 50%, probably not a very useful speed. The 6k setup will probably do 10-12t/s at Q4.

plagiarist · 2025-02-01T14:25:22 1738419922

Is there a source that unrolls that without creating an account?

bhaak · 2025-02-01T14:27:36 1738420056

https://xcancel.com/carrigmat/status/1884244369907278106?s=4...

plagiarist · 2025-02-01T14:59:08 1738421948

Thank you! Domain seems easy to remember, too.