Not sure if you guys know: Groq already doing this with their ASIC chips. So... ...

latchkey · 2024-09-27T23:40:29 1727480429

Probably more than 2x...

"Semi analysis did some cost estimates, and I did some but you’re likely paying somewhere in the 12 million dollar range for the equipment to serve a single query using llama-70b. Compare that to a couple of gpus, and it’s easy to see why they are struggling to sell hardware, they can’t scale down.

Since they didn’t use hbm, you need to stich enough cards together to get the memory to hold your model. It takes a lot of 256mb cards to get to 64gb, and there isn’t a good way to try the tech out since a single rack really can’t serve an LLM."

https://news.ycombinator.com/item?id=39966620

faangguyindia · 2024-09-28T13:51:25 1727531485

Groq is unpredictable and while it might be fast for some requests about it's super slow or fails on others.

Fastest commercial model is Google's Gemini Flash (predictable speed)

qwertox · 2024-09-27T21:26:18 1727472378

The way I see it, is that one day we'll be buying small LLM cartridges.