But I use it for many other use cases and hosting protocol based services that w...

adeon · 2025-02-06T07:54:48 1738828488

How tokens per second do you get typically? I rent a Hetzner server with EPYC 48-core 9454P, but it's got only 256GB. A typical infer speed is in ballpark of ~6 tokens/second with llama.cpp. Memory is some DDR5 type. I think I have the entire machine for myself as a dedicated server on a rack somewhere but I don't know enough about hosting to say for certainty.

I have to run the, uh, "crispier" very compressed version because otherwise it'll spill into swap. I use the 212GB .gguf one from Unsloth's page, with a name that I can't remember on top of my head but I think it was the largest they made using their specialized quantization for llama.cpp. Jpeggified weights. Actually I guess llama.cpp quantization is a bit closer analogy to the reducing number of colors rather than jpeg-style compression crispiness? Gif had reduced colors (256) IIRC. Heavily gif-like compressed artificial brains. Gifbrained model.

Just like you, I use it for tons of other things that have nothing to do with AI, it just happened to be convenient that Deepseek-R1 came out and just about barely is able to run it on this thing, with enough quality to be coherent. My use otherwise is mostly hosting game servers for my friend groups or other random CPU-heavy projects.

I haven't investigated myself but I've noticed in passing: There is a person on llama.cpp and in /r/localllama who is working on specialized CPU-optimized Deepseek-R1 code, and saw them asking for an EPYC machine for testing, with specific request for a certain configuration. IIRC also said that the optimized version needs new quants to get the speeeds. So maybe this particular model will get some speedup if that effort succeeds.