"TensorWave is a cloud provider specializing in AI workloads. Their platform lev...

nabla9 · 2024-06-13T12:02:39 1718280159

The salt is in the plain sight.

The do the standard AMD comparison:

  8x AMD MI300X (192GB, 750W) GPU  
  8x H100 SXM5 (80GB, 700W) GPU

The fair comparison would be against

  8x H100 NVL (188GB, <800W) GPU

Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price.

nabla9 · 2024-06-13T12:37:54 1718282274

                 MTr
  ------------------
  H100 SXM5   80,000 
  MI300X     153,000
  H100 NVL   160,000

H100 SXM4 has 52% of the transistors MI300X has, half of the RAM and MI300X achieves *ONLY* 33% higher throughput compared to the H100. MI300X was launched 6 months ago, H100 20 months ago.

AMD has work to do.

infamia · 2024-06-13T13:15:57 1718284557

On the other hand, 33% better performance for a 7% increase in power consumption is an appealing bullet point. Lots for AMD to do though, as you said.

selimnairb · 2024-06-13T14:15:35 1718288135

Maybe I'm a naive fanboy, but I would put my money on Apple catching Nvidia before AMD or Intel.

dragontamer · 2024-06-13T17:04:46 1718298286

Apple doesn't have any hardware SIMD technology that I'm aware of.

At best, Apple has Metal API which iOS video games use. I guess there's a level of SIMD-compute expertise here, but it'd take a lot of investment to turn that into a full scale GPU that tangos with Supercomputers. Software is a bit piece of the puzzle for sure, but Metal isn't ready for prime time.

I'd say Apple is ahead of Intel (Intel keeps wasting their time and collapsing their own progress from Xeon Phi / Battlemage / etc. etc. Intel cannot keep investing in its own stuff to reach critical mass). Intel does have OneAPI but given how many times Intel collapses everything and starts over again, I'm not sure how long OneAPI will last.

But Apple vs AMD? AMD 100% understands SIMD compute and has decades worth of investments in it. The only problem with AMD is that they don't have the raw cash to build out their expertise to cover software, so AMD has to rely upon Microsoft (DirectX), Vulkan, or whatever. ROCm may have its warts, but it does represent over a decade of software development too (especially when we consider that ROCm was "Boltzmann", which had several years of use before it came out as ROCm).

-------

AMD ain't perfect. They had a little diversion with C++Amp with Microsoft (and this served as the API for Boltzmann / early ROCm). But the overall path AMD is making at least makes sense, if a bit suboptimal compared to NVidia's huge efforts into CUDA.

I'd definitely rate AMD's efforts above Apple's Metal.

aurareturn · 2024-06-14T04:51:06 1718340666

Apple makes a better consumer GPU than AMD does.

M3 Max's GPU is significantly more efficient in perf/watt than RDNA3, already has better ray tracing performance, and is even faster than a 7900XT desktop GPU in Blender.[0]

[0]https://opendata.blender.org/benchmarks/query/?compute_type=...

captainbland · 2024-06-14T09:22:36 1718356956

Couple of things: Blender uses HIP for AMD which is nerfed in RDNA3 because of product segmentation, so really this is comparing against something which is deliberately mediocre in the 7900 XT.

The M3 Max is also in a sense a generation ahead in terms of perf/watt of the 7900 XT as it uses a newer manufacturing node.

I suppose it's also worth highlighting that if you enable Optix in the comparison above, you can see Nvidia parts stomping all over both AMD and Apple parts alike.

aurareturn · 2024-06-14T10:48:36 1718362116

Why does AMD nerf RDNA3 when they're so far behind Nvidia and Apple in Blender performance? Do you have benchmarks for when AMD doesn't nerf Blender performance?

M3 Max GPU uses at most 60-70w. Meanwhile, the 7900XT uses up to 412w in burst mode.[0] TSMC N3 (M3 Max) uses 25-30% less power than TSMC N5 (7900XT). [1] In other words, if 7900XT used N3 and optimizes for the same performance, it would burst to 300w instead which is still 5-6x more than M3 Max. In other words, the perf/watt advantage of the M3 Max is mostly not related to the node used. It's the design.

[0]https://www.techpowerup.com/review/amd-radeon-rx-7900-xt/37....

[1]https://www.anandtech.com/show/18833/tsmc-details-3nm-evolut...

dragontamer · 2024-06-14T15:44:56 1718379896

Its weird that you're choosing a nerf'd part and sticking with it as a comparison point.

The article is MI300X, which is beating NVidia's H100.

> Do you have benchmarks for when AMD doesn't nerf Blender performance?

Go read the article above.

> Notably, our results show that MI300X running MK1 Flywheel outperforms H100 running vLLM for every batch size, with an increase in performance ranging from 1.22x to 2.94x.

-------

> Why does AMD nerf RDNA3 when they're so far behind Nvidia and Apple in Blender performance?

Nerf is a weird word.

AMD has focused on 32-bit FLOPs and 64-bit FLOPs until now. AMD never put much effort into raytracing. They reach acceptable levels on XBox / PS5 but NVidia always was pushing Raytracing (not AMD).

Similarly: Blender is a raytracer that uses those Raytracing cores. So any chip with substantial on-chip ray-tracing / ray-matching / ray-intersection routines will perform faster.

Blender isn't what people do with GPUs. The #1 thing they do is video games like Baldur's gate 3.

-------

It'd be like me asking why Apple's M3 can't run Baldur's gate 3. Its not a "nerf", its a purposeful engineering decision.

aurareturn · 2024-06-15T13:34:17 1718458457

I was responding to the person above me, who used the word "nerf" to describe RDNA and Blender.

ein0p · 2024-06-13T17:45:27 1718300727

Their GPUs are literally “hardware SIMD”, and Metal is conceptually very close to CUDA. Apple just chooses to focus on consumer hardware instead.

preisschild · 2024-06-13T14:23:50 1718288630

But Apple doesn't produce servers or server hardware.

bbatha · 2024-06-13T16:09:24 1718294964

Currently no, but the XServe was a product for a decade. And they have built an internal ML cloud, presumably with rack mountable hardware. The bigger issue for apple IMO is they ditched the server features of their OS and they're not going to sell a hypothetical M4ultra Xserve with linux.

threecheese · 2024-06-13T14:42:43 1718289763

Good point; if the Mx architecture does prove to be a viable competitor to Nvidia/AMD for training and/or inference, do you think Apple would enter the server market? They continue to diversify on the consumer side, I wonder if they have their eye on the business market; I am not sure if their general strategy of “prettier is better” would work well there though.

sroussey · 2024-06-13T15:17:39 1718291859

They should buy Grok (the hardware inference company, not the lame twitter bot).

fmajid · 2024-06-15T13:15:05 1718457305

Clearly they are building their own Apple Silicon powered servers for their Private Computing Cloud, even if it is not sold to outsiders like the XServe used to be.

ein0p · 2024-06-13T17:46:32 1718300792

I don’t think that’s true anymore: see WWDC presentation about Private Cloud

fleischhauf · 2024-06-13T12:32:36 1718281956

AMDs deep learning libraries are very bad the last time I checked, nobody uses amd in that space for that reason. Nvidia has a quazi monopoly, that's the main reason for the price difference IMHO.

More-nitors · 2024-06-13T13:37:22 1718285842

this...

nearly 95% of deeplearning github repos are "tested using cuda gpu - others, not so sure"

the only way to out-run nvidia is to have 3~10x better bang-for-buck.

Or AMD can just provide a "DIY unlimited gpu RAM upgrade" kit -- a lot of people are buying macstudio 128gb ram because of its "bigger ram-for-buck" than nvidia gpus

sroussey · 2024-06-13T15:21:14 1718292074

I heard apple m4 ultra using 256gb HBM for studio and pro, but I don’t buy it. The 256GB maybe. But a HBM memory control that would go unused on laptops doesn’t pass the smell test.

tracker1 · 2024-06-13T14:12:34 1718287954

I think their best option might be more/better prosumer options at the higher end of consumer pricing. Getting more hobbyists into play just on the value proposition.

nabla9 · 2024-06-13T12:40:26 1718282426

> that's the main reason for the price difference IMHO.

Explain why the performance difference does not matter?

AMD does only 33% better with a chip that has 2X transistors and 2X memory.

lhl · 2024-06-13T12:49:37 1718282977

Isn't SXM5 higher bandwidth? It's 900 GB/s of bidirectional bandwidth per GPU across 18 NVLink 4 channels. The NVL's are on PCIe 5, and even w/ NVLink only get to 600 GB/s of bandwidth across 3 NVLink bridges (across only pairs of cards)?

I haven't done a head to head and I suppose it depends on whether tensor parallelism actually scales linearly or not, but my understanding is since the NVL's are just PCIe/NVLink paired H100s, you're not really getting much if any benefit on something like vLLM.

I think the more interesting thing critique might be the slightly odd choice of Mixtral 8x7B vs say a more standard Llama2/3 70B (or just test multiple models including some big ones like 8x22B or DBRX.

Also, while I don't have a problem w/ vLLM, as TensorRT gets easier to set up, it might become a factor in comparisons (since they punted on FP8/AMP in this tests). Inferless published a shootoff a couple months ago comparing a few different inference engines: https://www.inferless.com/learn/exploring-llms-speed-benchma...

Price/perf does tell a story, but I think it's one that's mostly about Nvidia's platform dominance and profit margins more than intrinsic hardware advantages. On the spec sheet MI300X has a memory bandwidth and even raw FLOPS advantage but so far it has lacked proper software optimization/support and wide availability (has anyone besides hyperscalers and select partners been able to get them?)

nabla9 · 2024-06-13T13:03:26 1718283806

> but I think it's one that's mostly about Nvidia's platform dominance and profit margins more

Profit margins and dominance are result from performance, not the other way around.

It does not matter if Nvidia tools are better when you deploy large number of chips for inference and it does more flops per watt or second. It's seller market and if AMD can't ask high price, their chip do not perform.

----

Question:

People here seem to think that Nvidia has absolutely no advantage in their microarchitecture design skills. It's all in software or monopoly.

Is this right?

fisf · 2024-06-13T13:26:55 1718285215

> People here seem to think that Nvidia has absolutely no advantage in their microarchitecture design skills. It's all in software or monopoly.

That's an extrapolation. Microarchitecture design skills are not theoretical numbers you manage to put on a spec sheet. You cannot decouple the software driving the hardware - that's not a trivial problem.

paulmd · 2024-06-13T18:20:22 1718302822

> Microarchitecture design skills are not theoretical numbers you manage to put on a spec sheet

not only can you measure this, not only do they measure this, but it's literally the first component of the Rayleigh resolution equation and everyone is constantly optimizing for it all the time.

https://youtu.be/HxyM2Chu9Vc?t=196

https://www.lithoguru.com/scientist/CHE323/Lecture48.pdf

in the abstract, why does it surprise you that the semiconductor industry would have a way to quantify that?

like, realize that NVIDIA being on a tear with their design has specifically coincided with the point in time when they decided to go all-in on AI (2014-2015 era). Maxwell was the first architecture that showed what a stripped-down architecture could do with neural nets, and it is pretty clear that NVIDIA has been working on this ML-assisted computational lithography and computational design stuff for a while. Since then, I would say - but they've been public about it for several years now (and might be longer, I'd have to look back).

https://www.newyorker.com/magazine/2023/12/04/how-jensen-hua...

https://www.youtube.com/watch?v=JXb1n0OrdeI&t=1383s

Since that "mid 2010s" moment, it's been Pascal vs Vega, Turing (significant redesign and explicit focus on AI/tensor) vs RDNA1 (significant focus on crashing to desktop), Ampere vs RDNA2, etc. Since then, NVIDIA has almost continuously done more with less: beaten custom advanced tech like HBM with commodity products and small evolutions thereupon (like GDDR5X/6X), matched or beaten the efficiency of extremely expensive TSMC nodes with junk samsung crap they got for a song, etc. Quantitatively by any metric they have done much better than AMD. Like Vega is your example of AMD design? Or RDNA1, the architecture that never quite ran stable? RDNA3, the architecture that still doesn't idle right, and whose MCM still uses so much silicon it raises costs instead of lowering them? Literally the sole generation that's not been a total disaster from The Competition has been RDNA2, so yeah, solid wins and iteration is all it takes to say they are doing quantitatively better, especially considering NVIDIA was overcoming a node disadvantage for most of that. They were focused on bringing costs down, and frankly they were so successful despite that that AMD kinda gave up on trying to outprice them.

Contrast to the POSCAP/MLCC problem in 2020: despite a lot of hype from tech media that it was gonna be a huge scandal/cost performance, NVIDIA patched it dead in a week with basically no perf cost etc. Gosh do you think they might have done some GPGPU accelerated simulations to help them figure that out so quickly, how the chip was going to boost and what the transient surges were going to be etc?

literally they do have better design skills, and some of it is their systems thinking, and some of it is their engineers (they pay better/have better QOL and working conditions, and get the cream of the crop), and some of it is their better design+computational lithography techniques that they have been dogfooding for 3-4 generations now.

people don't get it: startup mentality, founder-led, with a $3t market cap. Jensen is built different. Why wouldn’t they have been using this stuff internally? That’s an extremely Jensen move.

Symmetry · 2024-06-13T13:56:51 1718287011

NVidia doesn't have a general advantage in hardware design skills, but they have been focused on AI workloads for quite a while while AMD spent a long time focusing on HPC factors like 64 bit floating point performance.

ebalit · 2024-06-13T18:09:03 1718302143

But the price should be a factor. Your fair comparison would match a ~60k$ setup to a 20k$ according to prices we can find online.

I don't think it should be ignored, especially when the power consumption is similar.

fvv · 2024-06-14T07:26:09 1718349969

fair? h100 NVL are two h100 in a single package.. which probably costs 2xh100 or more,

if so ok it's fair to compare 1 mi300x with 1 h100 NVL but then price ( and tco ) should be added to the some metrics conclusion , also the NVL is a 2xpci5.0 quad slot , so not the same thing..

I am not sure about system compatibility and if and how you can stack 8 of those in one system ( like you can do with non NVL and mi300x.. ) so it's a bit a diffent ( and more niche ) beast

sangnoir · 2024-06-13T15:50:16 1718293816

> Price tells a story. If AMD performance would be in par with Nvidia they would not sell their cards for 1/4 price

What were your thoughts on Zen (1) vs Intel's offerings then? AMD offered more back for the buck then too.

winux-arch · 2024-06-14T11:27:52 1718364472

Price tells the story. Yes but for electric prices not card prize and here their much more close to each other!

resource_waste · 2024-06-13T12:53:39 1718283219

Thx! Anyone who says Nivida isnt king, needs a reality check.

ddtaylor · 2024-06-13T13:30:06 1718285406

I love AMD, but my Nvidia stock position currently is much higher than AMD.

hackerlight · 2024-06-13T14:19:23 1718288363

AMD is at a much higher PE ratio. Is the market expecting AMD to up its game in the GPU sector? Or is the market expecting a pullback in GPU demand due to possibility for non-GPU AI solutions becoming the frontier or for AI investment to slow down?

muxr · 2024-06-16T22:39:55 1718577595

AMD doesn't have a higher PE ratio. You should use nonGAAP numbers because the GAAP numbers include Xilinx Googwill Amortization which skew the PE.

AMD's PE is ~55. Nvidia's PE is above 70.

cameldrv · 2024-06-13T17:14:59 1718298899

I think that the expectation is that NVIDIA is in somewhat of an unreasonable position right now (and for the immediate future) where they're getting about 80% gross margins on their datacenter GPUs. This is an extremely juicy target for competitors, and even if competitors manage to produce a product that's half as good as NVIDIA, NVIDIA will have to cut prices to compete.

sangnoir · 2024-06-13T16:21:50 1718295710

Why not both?

szundi · 2024-06-13T14:07:13 1718287633

This topic is just about wether this changes or not

epolanski · 2024-06-13T09:30:04 1718271004

Well, there's the beauty of specifying exactly how you ran your benchmark, it is easy to reproduce and disprove or confirm (assuming you got the hardware).

scotty79 · 2024-06-13T11:47:36 1718279256

As easy as getting yourself 8 H100 and 8 MI300X.

Fun weekend project for anybody.

idiliv · 2024-06-13T11:56:57 1718279817

You can rent them online for ~ 4-5 $ per hour per GPU. Not cheap, but definitely feasible as a weekend project.

_zoltan_ · 2024-06-13T12:32:09 1718281929

where can I rent a H100 for 4-5 dollars an hour?

AWS doesn't let you use p5 instances (not getting a quota as a private person), lambda cloud is sold out.

lhl · 2024-06-13T12:52:42 1718283162

It looks like Runpod currently (checked right now) has "Low" availability of 8x MI300 SXM (8x$4.89/h), H100 NVL (8x$4.39/h), and H100 (8x$4.69/h) nodes for anyone w/ some time to kill that wants to give the shootout a try.

darrick_horton · 2024-06-13T14:26:55 1718288815

We'd be happy to provide access to MI300X at TensorWave so you can validate our results! Just shoot us an email or fill out the form on our website

Jlagreen · 2024-06-17T14:24:53 1718634293

If you're able to advertise available GPU compute in some public forums then it's enough to tell us about the demand of MI300X in cloud ...

lhl · 2024-06-18T05:39:45 1718689185

You're joking/trolling right? There are literally 10's of thousands of H100s available on gpulist right now, does that mean there's no cloud demand for Nvidia gpus? (I notice from your comment history that you seem to be some sort of bizarre NVDA stan account, but come on, be serious)

impulser_ · 2024-06-13T09:41:40 1718271700

If they used Nvidia's chip would this somehow make the blog post better?

aurareturn · 2024-06-13T09:49:40 1718272180

For one, they didn't use TensorRT in the test.

Also, stuff like this is hard to take the results seriously:

  * To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.

  * All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.

They did everything they can to make sure AMD is faster.

ebalit · 2024-06-13T13:52:56 1718286776

You need 2 H100 to have enough VRAM for the model whereas you need only 1 MI300X. Doubling the total throughput (for all completions) of 1 MI300X to simulate the numbers for a duplicated system is reasonable.

They should probably show separately the throughput per completion as the tensor parallelism is often used for that purpose in addition to the doubling the VRAM.

aurareturn · 2024-06-15T13:36:03 1718458563

What's the cost to run 2x H100 and 1x MI300X?

I think that'd give us a better idea of perf/cost and whether multiplying MI300X results by 2 is justified.

muxr · 2024-06-16T22:41:22 1718577682

I don't understand why they should use TensorRT. vLLM is much more popular and it was actually written for Nvidia. It also supports AMD hardware, so it's the appropriate tool to compare.

davidguetta · 2024-06-13T11:48:53 1718279333

So they just multipiled their results per 2 ^^ ?

braiamp · 2024-06-13T10:19:33 1718273973

I see it as they did everything they can to compare the specific code path. If your workload scales with FP16 but not with tensor cores, then this is the correct way to test. What do you need for LLM inference?

HPsquared · 2024-06-13T11:59:42 1718279982

Couldn't they find a real workload that does this?

ebalit · 2024-06-13T13:45:40 1718286340

vLLM inference of Mixtral in fp16 is a real workload. I guess the details are there because of the different inference engine used. You need the most similar compute tasks to be ran but the compute kernels can't be the same as in the end they need to be ran by a different hardware.