I think it would be more interesting doing this with smaller models (33b-70b) an...

gliptic · 2025-02-01T12:42:09 1738413729

Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.

ants_everywhere · 2025-02-01T13:44:52 1738417492

With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?

gliptic · 2025-02-01T14:59:04 1738421944

I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.

ants_everywhere · 2025-02-01T15:50:27 1738425027

awesome thanks!

EVa5I7bHFq9mnYK · 2025-02-11T21:19:49 1739308789

GPUs were last used for Bitcoin mining in 2013, so you shouldn't be concerned unless you are buying a GTX 780.

pmarreck · 2025-02-01T13:52:59 1738417979

M4 Mac with unified GPU RAM

Not very cheap though! But you get a quite usable personal computer with it...

gliptic · 2025-02-01T15:01:52 1738422112

Any that can run 70B at >5 t/s are >$2k as far as I know.

jjallen · 2025-02-01T13:14:23 1738415663

How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while

Gracana · 2025-02-01T13:44:24 1738417464

You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.

jjallen · 2025-02-01T14:15:21 1738419321

So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.

ynniv · 2025-02-01T14:33:02 1738420382

Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.

gliptic · 2025-02-01T14:51:12 1738421472

I arbitrarily chose $1k as the "cheap" cut-off. Two 3090 is definitely the most bang for the buck if you can fit them.

api · 2025-02-01T13:02:00 1738414920

Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.

Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?

For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.

jenny91 · 2025-02-01T13:58:18 1738418298

As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.

kevinak · 2025-02-01T14:12:16 1738419136

You mean Qwen 32b fine-tuned on Deepseek :)

There is only one model of Deepseek (671b), all others are fine-tunes of other models

driverdan · 2025-02-01T14:18:37 1738419517

> you can get an RTX 3090 for ~$1.2k

If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.

bick_nyers · 2025-02-01T16:09:11 1738426151

It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.

firtoz · 2025-02-01T12:37:50 1738413470

Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least

gliptic · 2025-02-01T12:46:35 1738413995

That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.

etra0 · 2025-02-01T14:13:45 1738419225

it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.

or well, I guess you save a bit on power usage.

unethical_ban · 2025-02-06T18:34:51 1738866891

I suppose it makes sense, for extremely GPU centric applications, that the pi be used essentially as a controller for the 3090.

firtoz · 2025-02-01T15:29:46 1738423786

Oh, I was naive to think that the Pi was capable of some kind of magic (sweaty smile emoji goes here)

spaceport · 2025-02-01T18:48:36 1738435716

I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.

Svoka · 2025-02-01T16:20:51 1738426851

You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.