Ah, we're running a medium amount of compute at zero-margin. The point is not to go sell the Fortune 500, but to make sure a grad student can spend a $50k grant.
Right now, it's pretty easy to get a few A/H100s (Lambda is great for this), but very hard to get more than 24 at a reasonable price ($~2 an hour). One often needs to put up a 6+ month commitment, even when they may only want to run their H100s for an 8 hour training run.
It's the right business decision for GPU brokers to do long term reservations and so on, and we might do so too if we were in their shoes. But we're not in their shoes and have a very different goal: arm the rebels! Let someone who isn't BigCorp train a model!
> but to make sure a grad student can spend a $50k grant.
As a graduate student, thank you. Thankfully, my workloads aren't LLM crazy so I can get by on my old NVIDIA consumer hardware, but I have coworkers struggling to get reasonable prices/time for larger scale hardware.
So what happens when some big bucks VC backed closed source LLM company buys all your compute inventory for the next 5 years? This is not that unlikely. Lambda Labs a little while back was completely sold out of all compute inventory.
Yeah we aren’t going to let anyone book the whole thing for years. If we ever have to make the choice, we’ll choose the startups over the big companies.
Very similar price, but from what I gather very different model. One important difference might be if you regularly run short-ish training runs over many GPUs. Lambdalabs might not have 256 instances to give you right now. With OP you are basically buying the right to put jobs in the job queue for their 512 GPU cluster, so running a job that needs 256 GPUs isn't an issue (though you might wait behind someone running a 512 GPU job).
No idea how capacity at lambdalabs actually looks like though. Does anyone have insight how easy it is to spin up more than 2-3 instances up there?
Yeah it’s pretty hard to find a big block of GPUs that you can use for a short time, esp if you need infiniband for multinode training. Lambda I think needs a min reservation of 6-12 months if you want IB.
My question too. At $2/hr for H100 that seems more flexible? But I haven’t tried to get 10k GPU-hours on any of these services, maybe that is where the bottleneck is.