Hacker News new | past | comments | ask | show | jobs | submit login

GPU cost ÷ GPU token throughput = cost per token, will get you close then you compare the cost per token between solutions.

To explain how to calculate it precisely, I would have to write a blog post. There are dozens of factors that go into it and they vary based on your use case like GPU type/setup, cloud provider, inferencing engine, context size, the minimum throughput and latency you'd be willing to have your users experience, LLM quantization, KV cache configuration, etc.

I'm sure there are cost analyses out there for Llama 3.1 70b you could find though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: