Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this


Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.


With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?


I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.


awesome thanks!


GPUs were last used for Bitcoin mining in 2013, so you shouldn't be concerned unless you are buying a GTX 780.


M4 Mac with unified GPU RAM

Not very cheap though! But you get a quite usable personal computer with it...


Any that can run 70B at >5 t/s are >$2k as far as I know.


How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while


You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.


So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.


Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.


I arbitrarily chose $1k as the "cheap" cut-off. Two 3090 is definitely the most bang for the buck if you can fit them.


Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.

Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?

For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.


As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.


You mean Qwen 32b fine-tuned on Deepseek :)

There is only one model of Deepseek (671b), all others are fine-tunes of other models


> you can get an RTX 3090 for ~$1.2k

If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.


It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.


Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least


That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.


it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.

or well, I guess you save a bit on power usage.


I suppose it makes sense, for extremely GPU centric applications, that the pi be used essentially as a controller for the 3090.


Oh, I was naive to think that the Pi was capable of some kind of magic (sweaty smile emoji goes here)


I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.


You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: