I have pretty much the same setup but just 4 3080s in one machine. If you go to rent something from vast.ai, you can see the hardware specs of the most performant machines. I just copied those and pieced together the rest. Learning about PCIe versions and ports was very interesting, as I thought I could use another motherboard for a similar purpose, but just because there are four PCIe 3 slots, it doesn't mean they're all usable at 16x AT THE SAME TIME!
I bought 21 R12 Aurora Alienwares and turned a side-office in my garage into a hot-ass crypto farm with two swamp coolers, four 20A circuits, a 15A circuit, and a bunch of surge protectors. I was always afraid to move something in fear that I'd overload a circuit due to the effort in calculating which power supply was powering which machine(s).
I gave away the 3080s to friends and family and kept the 6 3090s.
Oh yeah, some days I was making $600-800 profit a day! But most days it was ~$100-$200/day. The problem was "the merge" to Ethereum 2 was soon approaching, but was luckily delayed 1.5 years, so I was able to pull ~$80k profit during that time. It was awesome. But now I have a couple holes in my garage office with big fans installed (one on the bottom for intake and one on the top for exhaust, essentially turning that room into a big computer tower).
Money in the stock market (or crypto market but similar) is probably the only way to justify tens of thousands of dollars worth of GPUs that become obsolete quickly.
The other thing I can think of is Reseach scientists who use it for research. But funded research scientists that I know generally don't get hardware to take home. They usually either get tons of cloud credits or time on a supercomputer to do their stuff.
Well, I can say that at $0.0875/kWh, it was a steal. I don't want to draw any conclusions about why they recently raised electricity price for everyone by 25%... But I'm suspicious that I and similar others may have contributed.
Consumer electronics are designed for consumer living conditions. In many parts of the world, indoor humidity can sit over 60%, all year long. In a dry climate, you could get some significant cooling, and still be under what you'll find in my house.
Phase change based cooling, especially when combined with a natural source of low temp air (such as a large basement) can be remarkably effective. Your typical DC has much too high a power density to be able to utilize this but for a single large rig in a home it could work very well.
The humidity (in Utah) in that room was like 4%. Datacenters should be between ~45 or 60-80% humidity, depending on which source you read, to prevent electrostatic discharge. It was way cheaper than installing and running a mini-split. Plus, I NEEDED the added humidity.
Also taking the risk on assuming it'll get stuff like microcode updates, notable for the constant churn of new speculative execution attacks or other things that can cause those impossible-to-track-down occasional instability issues.
> This is because without dropping serious $$$ on mellanox high-speed NICs and switches, inter-server communication bandwidth quickly becomes the bottleneck when training large models. I can’t afford fancy enterprise grade hardware, so I get around it by keeping my compute all on the same machine. This goal drives many of the choices I made in building out my servers, as you will see.
10gbe is very cheap now, but I guess that's not enough?
Yeah, you need 100gbe minimal. 10gbe is too little (PCIe bandwidth can be a bottleneck, and that is already clocked around 100GbE (16GB)).
BTW: echo to the author, PSU and in the U.S. (120v) is a major issue why I am limiting to 4-GPUs. Also, it seems 3090 still have NVLink support, wondering why the author haven't put that up. From what I experienced, NVLink does help if you run data parallel training.
You can do 100GbE for about $150 a port in switch cost (new); you sometimes see ConnectX-5 cards on eBay for about $100-150 a port (used). I’ve got a fairly good amount of 100GbE in my homelab. Pretty affordable in 2023.
200GbE and 400GbE is still totally unaffordable for anything remotely personal, IMO.
It is model-dependent. I've seen that (NVLink benefits) when comparing against PCIe-3 connection, with small batch size, no gradient accumulation.
Once you have larger batch size and gradient accumulation, DDP won't be improved by NVLink I believe (the all-reduce traffic on gradients will be small comparing to your computation overhead).
Yeah, I am talking about ~10% performance differences for a specific model (I believe it is when benchmarking vanilla seq2seq model (from AIAYN paper) with small batch size and no gradient accumulation).
NVLink is about ~$100, so for these cases, I guess you shouldn't expect it to be "more than buying another card" type of improvements.
I have seen some benefit on A100s only but am unfamiliar with how to get any nvlink gains on RTX cards. Monitoring vanilla pytorch DDP in nvtop on RTX, I haven’t seen PCIE bus transfer speeds approaching the theoretical max. The OP uses bifurcators in his 8-gpu box so clearly OP does not seem to be bus-limited.
I would be additionally skeptical that there’s any consumer, or even fire decal hardware (gamer brand stuff) that actually delivers consistently on 10gbe across several nodes.
Enthusiast / small business / entry level enterprise gear will get you there, but you’re looking at several hundred dollars per port.
10Gbe can be pretty cheap ($30 for X540) PC side with used hardware (SFP mostly and not multigig however). Even generic PCIe cards can add a RJ45 10G multigig port for ~$50.
The switch/router side is where I find it gets expensive.
True, but when you operate 7 GPUs on one board I reckon keeping a good eye on your thermals is where it starts. Otherwise you can kiss your GPUs goodbye well before you reach the payback moment.
Edit: and you'd have to factor in that cooling power as well into the running costs.
They might not be using nvlink due to selling on vast, someone could rent a single GPU out of the four available. No idea if there is some cross user security implication with nvlink in that scenario.
If these server boards support thunderbolt AIC's, and I believe they might as my Threadripper Pro board does, daisy chaining them together could get you 40Gbps somewhat easily, if that is sufficient.
What's your opinion on the coming (or not) GPU crunch?
When looking at cloud GPU availability and current trends (no one except some enthusiasts and bigtech is finetuning and serving on a large scale yet and results keep getting better and better), I fear we will run into a situation where GPUs will be extremely expensive and hard to come by until supply catches up?
I ordered a high end PC with 4090 for the first time in years (normally would always prefer cloud even if more expensive) because I want to be on the safe side. What do you think, is this irrational and just a bubble thing?
It's a temporary crunch while TSMC expands capacity. They were reluctant to invest in capacity during the crypto crunch because they (rightly) understood that to be a temporary spike in demand. This time, they (also rightly) recognize AI as a longer-term shift in demand, and they're busy retooling to meet it.
The real question is whether NVIDIA will maintain its lock on the market, or if vendor-agnostic Torch will help commoditize the segment.
Cuda runs on a lower layer than torch. There is not much competition to cuda if one wants performance, but if one just wants to run it there is cpu (openblas etc and Intel's MKL) and onnx.
It's pretty astonishing to me AMD has neither a proper math library (like MKL) nor gpu compute library (like cuda).
Without a fully connected NVLink network, the 3090s will be underutilized for models that distribute the layers across multiple GPUs.
If AMD were better supported, it would be most economical to use 4x MI60s for 128GB using an Infinity Fabric bridge. However, in order to get to the end of such a journey, you would have to know something.
The bifurcated risers mean that some of the cards are only running at x8 pcie speed as well, and they mention they are only working with pcie-3 not 4.
This would severely limit training using model parallelism.
For data parallel where the full model fits on each card and the batch size is just increased it wouldn't matter as much, and maybe that is the primary use for this.
I wonder how this is dealt with on vast.ai rentals. Because there is a huge difference if I needed 7x 3090's where I need all 168GB to load weights on a single giant LLM model vs. just wanting to run 4GB Stable Diffusion in parallel inference with a massive batch size....
It could completely depend on what size model your are training the topology, it may not make a difference for you, but for reference-
See here [0] about just the difference between having NV-link between cards or not, and the 23% increase in training speed, and the note that the peak bandwidth between 2x 3090s with the link is a peak of 112.5 GB/sec.
Now look at PCIe 3.0 speeds, which would be what any two cards talking to each other would need to use thru your risers- only 15.754 GB/s on x16 and only 7.877 GB/s if you are on a x8 riser.
For some non-ML things that I use GPU's for (CFD), the interconnect / memory access bandwidth is the bottleneck, and the simulation time literally scales near linearly with the PCIe bandwidth I have between cpu lanes and the cards.
Hm, that makes me wonder whether vast.ai does some kind of testing prior to dropping particular workloads on a machine so that this can be done without configuration. But there doesn't seem to be a financial incentive to suppliers of GPU capacity to provide for good interconnects.
vast.ai does denote GPU's as either PCIe or SXM (SXM for the A100's and now H100's up there). The SXM bandwidth between GPU's is the full n-way NV-link, the best you can get;
They also have a little stat that lists 'per-GPU' bandwidth in GB/s and what PCIe version and speed is being used. So they must run some tests beforehand to gauge this. When I look on there now it varys setup to setup. I see people running quad 4090's on PCIe4.0 x16 with 24GB/s bandwidth between them, some people running x8, some on PCIe3.0 with 11 GB/s, even someone with quad 3090's but all on PCIe 2.0 x1 slots with the bandwidth reading 0.3GB/s!!! (likely an old mining rig with those x1 slots)
I've never rented on there so I'm not sure. It seems like the prices are set by the sellers but with some guidance from their performance tool, likely?
The have a overall "DLperf" deep learning performance score which is an performance metric and it seems like if you want to get users you set your price per hour at something competitive in line with the market given that metric.
For instance, the guy with the quad 3090s on x1 slots actually has an hourly price set HIGHER than everyone else... even with horrible DLPerf score. No one is using that, ever.
Interesting... I should try some workloads to see what it can do in practice. Beats having a whole bunch of fans in my room and it seems to be cheaper than the other cloud GPU providers.
It was primarily being used to train TTS models (see https://github.com/neonbjb/tortoise-tts), which largely fit into a single GPUs memory. So, for data parallelism, x8 PCIe isn't that much of a concern.
...Yes, and thats the problem. As a reminder, the A6000 is basically a double RAM 3090 TI, a 2020 GPU going for $4K.
Nvidia is Nvidia as needs to preserve their pro tier stratification. But AMD has less to lose, and Intel has nothing to lose by pricing them more like gaming cards.
Probably because they want people to buy the W/MI series cards instead.
Though looking around you can get an MI60 on ebay for $500, seems really good for 32gb and still seems to be supported by rocm (as it's just an MI50 with more hbm). Looks to be the cheapest way of getting a GPU with that region of memory support, though things like FP16 speed and BF support suffers compared to later generations. Though from what I've seen most "home" ML tasks are often memory limited pretty hard before ALU limitations kick in. And no idea if a home user would be able to use the IF links either for helping multi-gpu either.
Likewise, based on the costs listed on their page I'd say no more than $.8 / hour or so assuming a 50% gross margin for vast.ai.
And that includes energy costs so I assume the OP has a cheap source of power. Here in NL I could not do this profitably, even off solar power it would be more efficient to sell that power to the grid than to use it to drive a GPU rig.
There is an actual difference. You can make acutal, real-world, impacts with today's LLMs in the existing business world. They can actually make a difference for customers.
Not sure what your context is here, but I delete comments from time to time. It works fine as long as you get to it reasonably quickly and nobody has replied yet.
From the FAQ:
"What does [deleted] mean?
The author deleted the post outright, or asked us to. Unlike dead posts, these remain deleted even when showdead is turned on.
I bought 21 R12 Aurora Alienwares and turned a side-office in my garage into a hot-ass crypto farm with two swamp coolers, four 20A circuits, a 15A circuit, and a bunch of surge protectors. I was always afraid to move something in fear that I'd overload a circuit due to the effort in calculating which power supply was powering which machine(s).
I gave away the 3080s to friends and family and kept the 6 3090s.