Hacker News new | past | comments | ask | show | jobs | submit login
My deep learning rig (2022) (nonint.com)
166 points by jacquesm on Aug 15, 2023 | hide | past | favorite | 82 comments



I have pretty much the same setup but just 4 3080s in one machine. If you go to rent something from vast.ai, you can see the hardware specs of the most performant machines. I just copied those and pieced together the rest. Learning about PCIe versions and ports was very interesting, as I thought I could use another motherboard for a similar purpose, but just because there are four PCIe 3 slots, it doesn't mean they're all usable at 16x AT THE SAME TIME!

I bought 21 R12 Aurora Alienwares and turned a side-office in my garage into a hot-ass crypto farm with two swamp coolers, four 20A circuits, a 15A circuit, and a bunch of surge protectors. I was always afraid to move something in fear that I'd overload a circuit due to the effort in calculating which power supply was powering which machine(s).

I gave away the 3080s to friends and family and kept the 6 3090s.


I have a dumb question - what do you do with it? And why is it better to have your own hardware for it?

Will you ever see positive return on investment from building this or is it feeding the nerd chute?


Oh yeah, some days I was making $600-800 profit a day! But most days it was ~$100-$200/day. The problem was "the merge" to Ethereum 2 was soon approaching, but was luckily delayed 1.5 years, so I was able to pull ~$80k profit during that time. It was awesome. But now I have a couple holes in my garage office with big fans installed (one on the bottom for intake and one on the top for exhaust, essentially turning that room into a big computer tower).


> what do you do with it?

>> crypto


QuantFinance.

Money in the stock market (or crypto market but similar) is probably the only way to justify tens of thousands of dollars worth of GPUs that become obsolete quickly.

The other thing I can think of is Reseach scientists who use it for research. But funded research scientists that I know generally don't get hardware to take home. They usually either get tons of cloud credits or time on a supercomputer to do their stuff.


> tens of thousands of dollars worth of GPUs that become obsolete quickly

I sold a 4yo gpu for more than I bought it for not long ago.

It's the only thing in a pc that doesn't depreciate much these days, besides maybe the case.


They wrote crypto in the original comment.


Ah okay missed it. So not QF. I was hoping someone does QF at home :P


If my hardware can do it, point me to how I can do it! I'll give it a try and let you know how you changed my life. ;)


Do you have a direct phone line to your local power company, and do you have to ask them for permission, before turning your rig on? :-)


Well, I can say that at $0.0875/kWh, it was a steal. I don't want to draw any conclusions about why they recently raised electricity price for everyone by 25%... But I'm suspicious that I and similar others may have contributed.


Swamp coolers for electronics?


Consumer electronics are designed for consumer living conditions. In many parts of the world, indoor humidity can sit over 60%, all year long. In a dry climate, you could get some significant cooling, and still be under what you'll find in my house.


Makes sense thanks


Phase change based cooling, especially when combined with a natural source of low temp air (such as a large basement) can be remarkably effective. Your typical DC has much too high a power density to be able to utilize this but for a single large rig in a home it could work very well.


In a dry climate the increased humidity still barely moves the needle and they’re extremely cheap to build and operate.


The humidity (in Utah) in that room was like 4%. Datacenters should be between ~45 or 60-80% humidity, depending on which source you read, to prevent electrostatic discharge. It was way cheaper than installing and running a mini-split. Plus, I NEEDED the added humidity.


> One of my EPYC’s is a retail model, the other is a QS model.

What's a "QS" model?

Okay, I searched and it means "Qualification Sample", i.e. a grey market, non-production grade CPU.

Never encountered this initialism before, despite being a CPU aficionado. Hope this saves you some frustration!


ES = Engineering Sample. Is also one that you come across from time to time.


Looks like I can get a $3000 MSRP EPYC 4th gen 9334 for $750. I can see why people buy them even if it’s not strictly legal.


Also taking the risk on assuming it'll get stuff like microcode updates, notable for the constant churn of new speculative execution attacks or other things that can cause those impossible-to-track-down occasional instability issues.


but will it run crysis


> This is because without dropping serious $$$ on mellanox high-speed NICs and switches, inter-server communication bandwidth quickly becomes the bottleneck when training large models. I can’t afford fancy enterprise grade hardware, so I get around it by keeping my compute all on the same machine. This goal drives many of the choices I made in building out my servers, as you will see.

10gbe is very cheap now, but I guess that's not enough?


Yeah, you need 100gbe minimal. 10gbe is too little (PCIe bandwidth can be a bottleneck, and that is already clocked around 100GbE (16GB)).

BTW: echo to the author, PSU and in the U.S. (120v) is a major issue why I am limiting to 4-GPUs. Also, it seems 3090 still have NVLink support, wondering why the author haven't put that up. From what I experienced, NVLink does help if you run data parallel training.


You can do 100GbE for about $150 a port in switch cost (new); you sometimes see ConnectX-5 cards on eBay for about $100-150 a port (used). I’ve got a fairly good amount of 100GbE in my homelab. Pretty affordable in 2023.

200GbE and 400GbE is still totally unaffordable for anything remotely personal, IMO.


Do you have any examples of nvlink improving performance for 3090 with vanilla pytorch DDP? Or are you talking some other training impl?


It is model-dependent. I've seen that (NVLink benefits) when comparing against PCIe-3 connection, with small batch size, no gradient accumulation.

Once you have larger batch size and gradient accumulation, DDP won't be improved by NVLink I believe (the all-reduce traffic on gradients will be small comparing to your computation overhead).


I've literally not seen any real (more than buying another card) improvement with NVlink on anything i've seen online over the past ~year.

I'd honestly be a bit suspect.


Yeah, I am talking about ~10% performance differences for a specific model (I believe it is when benchmarking vanilla seq2seq model (from AIAYN paper) with small batch size and no gradient accumulation).

NVLink is about ~$100, so for these cases, I guess you shouldn't expect it to be "more than buying another card" type of improvements.


I have seen some benefit on A100s only but am unfamiliar with how to get any nvlink gains on RTX cards. Monitoring vanilla pytorch DDP in nvtop on RTX, I haven’t seen PCIE bus transfer speeds approaching the theoretical max. The OP uses bifurcators in his 8-gpu box so clearly OP does not seem to be bus-limited.


I would be additionally skeptical that there’s any consumer, or even fire decal hardware (gamer brand stuff) that actually delivers consistently on 10gbe across several nodes.

Enthusiast / small business / entry level enterprise gear will get you there, but you’re looking at several hundred dollars per port.


10Gbe can be pretty cheap ($30 for X540) PC side with used hardware (SFP mostly and not multigig however). Even generic PCIe cards can add a RJ45 10G multigig port for ~$50. The switch/router side is where I find it gets expensive.


Couldn't you use a 240V dryer socket for that purpose? That should get you 7200 Watts on a 30A circuit.


You might need a second 240V outlet for the air conditioner to evacuate that much heat.


True, but when you operate 7 GPUs on one board I reckon keeping a good eye on your thermals is where it starts. Otherwise you can kiss your GPUs goodbye well before you reach the payback moment.

Edit: and you'd have to factor in that cooling power as well into the running costs.


They might not be using nvlink due to selling on vast, someone could rent a single GPU out of the four available. No idea if there is some cross user security implication with nvlink in that scenario.


If these server boards support thunderbolt AIC's, and I believe they might as my Threadripper Pro board does, daisy chaining them together could get you 40Gbps somewhat easily, if that is sufficient.


100gbe mellanox connect x cards are not actually that expensive though.


You'd need a switch too, unless you're going point-to-point but that will eat up PCI slots that you probably would like to use for GPUs.


mikrotik makes some pretty nice hardware for rather cheap. they now offer a 4x100Gbps switch for $800, which is a darn good price.


Indeed, at $200 / port that's really neat if that's all you need. Funny, I remember paying close to $1000 / port for 100 mbps not all that long ago :)


You really want gpu rdma though. It's a bit of a pain to get setup but it's worth it.


Not on the 3090 though, isn't it?


What's your opinion on the coming (or not) GPU crunch?

When looking at cloud GPU availability and current trends (no one except some enthusiasts and bigtech is finetuning and serving on a large scale yet and results keep getting better and better), I fear we will run into a situation where GPUs will be extremely expensive and hard to come by until supply catches up?

I ordered a high end PC with 4090 for the first time in years (normally would always prefer cloud even if more expensive) because I want to be on the safe side. What do you think, is this irrational and just a bubble thing?


It's a temporary crunch while TSMC expands capacity. They were reluctant to invest in capacity during the crypto crunch because they (rightly) understood that to be a temporary spike in demand. This time, they (also rightly) recognize AI as a longer-term shift in demand, and they're busy retooling to meet it.

The real question is whether NVIDIA will maintain its lock on the market, or if vendor-agnostic Torch will help commoditize the segment.


I am also thinking if I should build a rig to bet against expensive GPU prices.

My fear is that Nvidia trying to anchor GPU prices to crypto levels whether there is TSMC shortage is not.


"thanks Facebook" was not a term I have ever actually wanted to utter.


> vendor-agnostic Torch

Is this a competitor to CUDA?


Cuda runs on a lower layer than torch. There is not much competition to cuda if one wants performance, but if one just wants to run it there is cpu (openblas etc and Intel's MKL) and onnx.

It's pretty astonishing to me AMD has neither a proper math library (like MKL) nor gpu compute library (like cuda).


Without a fully connected NVLink network, the 3090s will be underutilized for models that distribute the layers across multiple GPUs.

If AMD were better supported, it would be most economical to use 4x MI60s for 128GB using an Infinity Fabric bridge. However, in order to get to the end of such a journey, you would have to know something.


The bifurcated risers mean that some of the cards are only running at x8 pcie speed as well, and they mention they are only working with pcie-3 not 4.

This would severely limit training using model parallelism.

For data parallel where the full model fits on each card and the batch size is just increased it wouldn't matter as much, and maybe that is the primary use for this.

I wonder how this is dealt with on vast.ai rentals. Because there is a huge difference if I needed 7x 3090's where I need all 168GB to load weights on a single giant LLM model vs. just wanting to run 4GB Stable Diffusion in parallel inference with a massive batch size....


I don't see much in terms of differentiation based on topology and interconnect, I also searched the FAQ. Maybe I'm missing something?


It could completely depend on what size model your are training the topology, it may not make a difference for you, but for reference-

See here [0] about just the difference between having NV-link between cards or not, and the 23% increase in training speed, and the note that the peak bandwidth between 2x 3090s with the link is a peak of 112.5 GB/sec.

[0]: https://huggingface.co/docs/transformers/v4.31.0/en/perf_har...

Now look at PCIe 3.0 speeds, which would be what any two cards talking to each other would need to use thru your risers- only 15.754 GB/s on x16 and only 7.877 GB/s if you are on a x8 riser.

For some non-ML things that I use GPU's for (CFD), the interconnect / memory access bandwidth is the bottleneck, and the simulation time literally scales near linearly with the PCIe bandwidth I have between cpu lanes and the cards.


Hm, that makes me wonder whether vast.ai does some kind of testing prior to dropping particular workloads on a machine so that this can be done without configuration. But there doesn't seem to be a financial incentive to suppliers of GPU capacity to provide for good interconnects.


vast.ai does denote GPU's as either PCIe or SXM (SXM for the A100's and now H100's up there). The SXM bandwidth between GPU's is the full n-way NV-link, the best you can get;

They also have a little stat that lists 'per-GPU' bandwidth in GB/s and what PCIe version and speed is being used. So they must run some tests beforehand to gauge this. When I look on there now it varys setup to setup. I see people running quad 4090's on PCIe4.0 x16 with 24GB/s bandwidth between them, some people running x8, some on PCIe3.0 with 11 GB/s, even someone with quad 3090's but all on PCIe 2.0 x1 slots with the bandwidth reading 0.3GB/s!!! (likely an old mining rig with those x1 slots)


Do they get more money per instance or is that simply a matter of seeing more utilization?


I've never rented on there so I'm not sure. It seems like the prices are set by the sellers but with some guidance from their performance tool, likely?

The have a overall "DLperf" deep learning performance score which is an performance metric and it seems like if you want to get users you set your price per hour at something competitive in line with the market given that metric.

For instance, the guy with the quad 3090s on x1 slots actually has an hourly price set HIGHER than everyone else... even with horrible DLPerf score. No one is using that, ever.


Interesting... I should try some workloads to see what it can do in practice. Beats having a whole bunch of fans in my room and it seems to be cheaper than the other cloud GPU providers.


It was primarily being used to train TTS models (see https://github.com/neonbjb/tortoise-tts), which largely fit into a single GPUs memory. So, for data parallelism, x8 PCIe isn't that much of a concern.


What kind of factor would that be?


I am duper frustrated that AMD doesn't make a "ML Edition" 48GB 7900. They dont have much to lose, so why not throw down the gauntlet.

Doubly so for Intel. They literally have no pro market to lose with a 32GB A770, and everything to gain from momentum for their stack.


The Radeon Pro W7900 is 48 GB. https://www.amd.com/en/products/professional-graphics/amd-ra...

The W7900 is officially supported by ROCm on Windows. On Linux, the W7900 is enabled, though not officially supported.


For ~$4000.

Thats not A/H100 terrible, but its still out of reach for many nonprofessional ML users/tinkerers.


It's roughly on par with NVIDIA RTX A6000, which also has 48GB VRAM.


...Yes, and thats the problem. As a reminder, the A6000 is basically a double RAM 3090 TI, a 2020 GPU going for $4K.

Nvidia is Nvidia as needs to preserve their pro tier stratification. But AMD has less to lose, and Intel has nothing to lose by pricing them more like gaming cards.


Probably because they want people to buy the W/MI series cards instead.

Though looking around you can get an MI60 on ebay for $500, seems really good for 32gb and still seems to be supported by rocm (as it's just an MI50 with more hbm). Looks to be the cheapest way of getting a GPU with that region of memory support, though things like FP16 speed and BF support suffers compared to later generations. Though from what I've seen most "home" ML tasks are often memory limited pretty hard before ALU limitations kick in. And no idea if a home user would be able to use the IF links either for helping multi-gpu either.


Interesting, wonder what the actual income from vast.ai looked like.


Likewise, based on the costs listed on their page I'd say no more than $.8 / hour or so assuming a 50% gross margin for vast.ai.

And that includes energy costs so I assume the OP has a cheap source of power. Here in NL I could not do this profitably, even off solar power it would be more efficient to sell that power to the grid than to use it to drive a GPU rig.


The vast.ai FAQ claims they take 25% of the hourly rate: https://vast.ai/faq#Hosting-General


> PSUs: 3x EVGA 1600W G+

With the proper plumbing, you could hook up your water heater to it as well.


Hehe, spot the Dutchman ;)

Co-generation is in fact a pretty good idea when you start running large computers at home. But in summer...


> Hehe, spot the Dutchman ;)

Guilty as charged.


Copper wire was invented in the Hague in 1880, when two residents ended up fighting over a cent...


Anyone know what kind of performance, in terms of token per second, one could expect with such a system running llama-2-70b?


That system is for training, it will probably spew tokens like a Terminator's machine gun.


We barely got done with mining craziness and now we have DL/LLM craziness..


There is an actual difference. You can make acutal, real-world, impacts with today's LLMs in the existing business world. They can actually make a difference for customers.

That was never the case with a crypto currency.


> That was never the case with a crypto currency.

That's a flat out lie and you know it.


Gamers though can breathe free, because LLM requires max GPU memory cards, while gamers are fine with smaller memory size cards


To a point. All those gaming gpus with 8gb aber starting to reach their limits


Until games will start bundling some 7B models to drive NPC dialogs.


[flagged]


Not sure what your context is here, but I delete comments from time to time. It works fine as long as you get to it reasonably quickly and nobody has replied yet.

From the FAQ:

"What does [deleted] mean?

The author deleted the post outright, or asked us to. Unlike dead posts, these remain deleted even when showdead is turned on.

https://news.ycombinator.com/newsfaq.html

[edit: formatting]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: