Hacker News new | past | comments | ask | show | jobs | submit login

I have an A6000, it’s about the most affordable for 48 GB VRAM (you can find for a little under $5k sometimes), which is roughly minimum to run a quantized 70b.

System RAM doesn’t really matter, but I have 128GB anyway as RAM is pretty cheap.




Why not 2 x 4090? Will be cheaper than A6000 if you can manage to find them at msrp, and will perform a lot better.


My time is worth a lot of money and 2x 4090 is more work, so it’s net more expensive in real terms.


For both inference and training I haven't seen any modern LLM stack take more time for multiple GPUs/tensor parallelism

I would take 1 RTX 6000 Ada, but if you mean the pre-Ada 6000, 2x4090 is faster for minimal hassle for most common usecases


I mean the newest ones. I only do LLM inference, whereas my training load is all DistilBERT models and the A6000 is a beast at cranking those out.

Also by “time” I mean my time setting up the machine and doing sys admin. Single card is less hassle.


The A6000 predates Ada?

There is the RTX 6000 Ada (practically unrelated to the A6000) which has 4090 level performance, that what you're referring to?



That's an Ampere A6000, one generation older than the Ada A6000. Nvidia decided that confusing model names are a good way to sell old products at a premium.


Running llama3.3:70b here on a pair of eBay Dell RTX3090s in an old (2012!) i3770 workstation - ollama reports 16.67 tokens/sec.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: