> inference is very expensive I am surprised that this claim keeps getting made,...

mgfist · 2025-08-18T20:20:17 1755548417

Running a local model is not an apples comparison. Yes, if you run a small model 24/7 without a care for output latency and utilization is completely static with no bursts, then it can look cheap. But most people want output now, not in 10 hours. And they want it from the best models. And they want large context windows. And when you combine that with serving millions of users, it gets complicated and expensive.

ben_w · 2025-08-19T07:02:24 1755586944

When you combine that with serving millions of users, it also gets amortised over several million users.

> But most people want output now, not in 10 hours.

At 65t/s, that's 2.5 million tokens output.

mgfist · 2025-08-20T16:33:27 1755707607

Yes, but usage is not uniform even when you have millions of users. It smooths the usage lines, but the peaks and troughs become more extreme the more users you have. At 3am usage in the US goes down to effectively 0. Maybe you can use the compute for Asia customers, but then you compete with local compute that has far better latency.

Then you have seasonal peaks/troughs, such as the school year vs summer.

When you want 4 9s of uptime and good latency, you either have to overprovision hardware and eat idling costs, or rent compute and pay overhead. Both cost a lot.