Could you clarify what running models multi-tenant means?

billycao · 2024-07-24T10:03:38 1721815418

Borrowing Matt's words from our reddit thread:

It means that we can spin up a single model server and use it for multiple people, effectively splitting the cost. Whereas if you try to rent the GPUs yourself on something like Runpod, you'll end up paying much more since you're the only person using the model.

- Billy

swalsh · 2024-07-24T10:06:11 1721815571

vLLM is nice because you can have multiple streams. I believe the tech is called batched attention, but working from memory so I might be wrong. When scaling up inference, this feature is super key. This is the difference between 70 tokens a second and 500 tokens a second using the same compute.

My assumption is they're calling this ability multitenant.