But I use it for many other use cases and hosting protocol based services that would otherwise expose me to advertising or additional service charges. It's just like people that buy solar panels instead of buying service from the power company. You get the benefit with your multi-year ROI.
I built the machine for $5500 four years ago and it certainly has not paid for itself, but it still has tons of utility and will probably last another four years bringing my monthly cost to ~$50/mo which is way lower than what a cloud provider would charge, especially considering egress network traffic. Instead of paying Discord, Twitter, Netflix/Hulu/Amazon/etc, paid game hosting, and ChatGPT, I can self host Jitsi/Matrix, Bluesky, Plex, SteamCMD, and ollama. In total I end up spending about the same, but I have way more control, better access to content, and can do more when offline for internet outages.
Thanks to CloudFlare Tunnel, I dont have to pay a cloud vendor, cdn or vpn for good routes to my web resources or opt into paid DDoS protection services. It's fantastic.
How tokens per second do you get typically? I rent a Hetzner server with EPYC 48-core 9454P, but it's got only 256GB. A typical infer speed is in ballpark of ~6 tokens/second with llama.cpp. Memory is some DDR5 type. I think I have the entire machine for myself as a dedicated server on a rack somewhere but I don't know enough about hosting to say for certainty.
I have to run the, uh, "crispier" very compressed version because otherwise it'll spill into swap. I use the 212GB .gguf one from Unsloth's page, with a name that I can't remember on top of my head but I think it was the largest they made using their specialized quantization for llama.cpp. Jpeggified weights. Actually I guess llama.cpp quantization is a bit closer analogy to the reducing number of colors rather than jpeg-style compression crispiness? Gif had reduced colors (256) IIRC. Heavily gif-like compressed artificial brains. Gifbrained model.
Just like you, I use it for tons of other things that have nothing to do with AI, it just happened to be convenient that Deepseek-R1 came out and just about barely is able to run it on this thing, with enough quality to be coherent. My use otherwise is mostly hosting game servers for my friend groups or other random CPU-heavy projects.
I haven't investigated myself but I've noticed in passing: There is a person on llama.cpp and in /r/localllama who is working on specialized CPU-optimized Deepseek-R1 code, and saw them asking for an EPYC machine for testing, with specific request for a certain configuration. IIRC also said that the optimized version needs new quants to get the speeeds. So maybe this particular model will get some speedup if that effort succeeds.
I built the machine for $5500 four years ago and it certainly has not paid for itself, but it still has tons of utility and will probably last another four years bringing my monthly cost to ~$50/mo which is way lower than what a cloud provider would charge, especially considering egress network traffic. Instead of paying Discord, Twitter, Netflix/Hulu/Amazon/etc, paid game hosting, and ChatGPT, I can self host Jitsi/Matrix, Bluesky, Plex, SteamCMD, and ollama. In total I end up spending about the same, but I have way more control, better access to content, and can do more when offline for internet outages.
Thanks to CloudFlare Tunnel, I dont have to pay a cloud vendor, cdn or vpn for good routes to my web resources or opt into paid DDoS protection services. It's fantastic.