Hacker News new | past | comments | ask | show | jobs | submit login

Could you clarify what running models multi-tenant means?



Borrowing Matt's words from our reddit thread:

It means that we can spin up a single model server and use it for multiple people, effectively splitting the cost. Whereas if you try to rent the GPUs yourself on something like Runpod, you'll end up paying much more since you're the only person using the model.

- Billy


vLLM is nice because you can have multiple streams. I believe the tech is called batched attention, but working from memory so I might be wrong. When scaling up inference, this feature is super key. This is the difference between 70 tokens a second and 500 tokens a second using the same compute.

My assumption is they're calling this ability multitenant.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: