"TensorWave is a cloud provider specializing in AI workloads. Their platform leverages AMD’s Instinct™ MI300X accelerators, designed to deliver high performance for generative AI workloads and HPC applications."
H100 SXM4 has 52% of the transistors MI300X has, half of the RAM and MI300X achieves *ONLY* 33% higher throughput compared to the H100. MI300X was launched 6 months ago, H100 20 months ago.
Apple doesn't have any hardware SIMD technology that I'm aware of.
At best, Apple has Metal API which iOS video games use. I guess there's a level of SIMD-compute expertise here, but it'd take a lot of investment to turn that into a full scale GPU that tangos with Supercomputers. Software is a bit piece of the puzzle for sure, but Metal isn't ready for prime time.
I'd say Apple is ahead of Intel (Intel keeps wasting their time and collapsing their own progress from Xeon Phi / Battlemage / etc. etc. Intel cannot keep investing in its own stuff to reach critical mass). Intel does have OneAPI but given how many times Intel collapses everything and starts over again, I'm not sure how long OneAPI will last.
But Apple vs AMD? AMD 100% understands SIMD compute and has decades worth of investments in it. The only problem with AMD is that they don't have the raw cash to build out their expertise to cover software, so AMD has to rely upon Microsoft (DirectX), Vulkan, or whatever. ROCm may have its warts, but it does represent over a decade of software development too (especially when we consider that ROCm was "Boltzmann", which had several years of use before it came out as ROCm).
-------
AMD ain't perfect. They had a little diversion with C++Amp with Microsoft (and this served as the API for Boltzmann / early ROCm). But the overall path AMD is making at least makes sense, if a bit suboptimal compared to NVidia's huge efforts into CUDA.
M3 Max's GPU is significantly more efficient in perf/watt than RDNA3, already has better ray tracing performance, and is even faster than a 7900XT desktop GPU in Blender.[0]
Couple of things: Blender uses HIP for AMD which is nerfed in RDNA3 because of product segmentation, so really this is comparing against something which is deliberately mediocre in the 7900 XT.
The M3 Max is also in a sense a generation ahead in terms of perf/watt of the 7900 XT as it uses a newer manufacturing node.
I suppose it's also worth highlighting that if you enable Optix in the comparison above, you can see Nvidia parts stomping all over both AMD and Apple parts alike.
Why does AMD nerf RDNA3 when they're so far behind Nvidia and Apple in Blender performance? Do you have benchmarks for when AMD doesn't nerf Blender performance?
M3 Max GPU uses at most 60-70w. Meanwhile, the 7900XT uses up to 412w in burst mode.[0] TSMC N3 (M3 Max) uses 25-30% less power than TSMC N5 (7900XT). [1] In other words, if 7900XT used N3 and optimizes for the same performance, it would burst to 300w instead which is still 5-6x more than M3 Max. In other words, the perf/watt advantage of the M3 Max is mostly not related to the node used. It's the design.
Its weird that you're choosing a nerf'd part and sticking with it as a comparison point.
The article is MI300X, which is beating NVidia's H100.
> Do you have benchmarks for when AMD doesn't nerf Blender performance?
Go read the article above.
> Notably, our results show that MI300X running MK1 Flywheel outperforms H100 running vLLM for every batch size, with an increase in performance ranging from 1.22x to 2.94x.
-------
> Why does AMD nerf RDNA3 when they're so far behind Nvidia and Apple in Blender performance?
Nerf is a weird word.
AMD has focused on 32-bit FLOPs and 64-bit FLOPs until now. AMD never put much effort into raytracing. They reach acceptable levels on XBox / PS5 but NVidia always was pushing Raytracing (not AMD).
Similarly: Blender is a raytracer that uses those Raytracing cores. So any chip with substantial on-chip ray-tracing / ray-matching / ray-intersection routines will perform faster.
Blender isn't what people do with GPUs. The #1 thing they do is video games like Baldur's gate 3.
-------
It'd be like me asking why Apple's M3 can't run Baldur's gate 3. Its not a "nerf", its a purposeful engineering decision.
Currently no, but the XServe was a product for a decade. And they have built an internal ML cloud, presumably with rack mountable hardware. The bigger issue for apple IMO is they ditched the server features of their OS and they're not going to sell a hypothetical M4ultra Xserve with linux.
Good point; if the Mx architecture does prove to be a viable competitor to Nvidia/AMD for training and/or inference, do you think Apple would enter the server market?
They continue to diversify on the consumer side, I wonder if they have their eye on the business market; I am not sure if their general strategy of “prettier is better” would work well there though.
Clearly they are building their own Apple Silicon powered servers for their Private Computing Cloud, even if it is not sold to outsiders like the XServe used to be.
AMDs deep learning libraries are very bad the last time I checked, nobody uses amd in that space for that reason. Nvidia has a quazi monopoly, that's the main reason for the price difference IMHO.
nearly 95% of deeplearning github repos are "tested using cuda gpu - others, not so sure"
the only way to out-run nvidia is to have 3~10x better bang-for-buck.
Or AMD can just provide a "DIY unlimited gpu RAM upgrade" kit -- a lot of people are buying macstudio 128gb ram because of its "bigger ram-for-buck" than nvidia gpus
I heard apple m4 ultra using 256gb HBM for studio and pro, but I don’t buy it. The 256GB maybe. But a HBM memory control that would go unused on laptops doesn’t pass the smell test.
I think their best option might be more/better prosumer options at the higher end of consumer pricing. Getting more hobbyists into play just on the value proposition.
Isn't SXM5 higher bandwidth? It's 900 GB/s of bidirectional bandwidth per GPU across 18 NVLink 4 channels. The NVL's are on PCIe 5, and even w/ NVLink only get to 600 GB/s of bandwidth across 3 NVLink bridges (across only pairs of cards)?
I haven't done a head to head and I suppose it depends on whether tensor parallelism actually scales linearly or not, but my understanding is since the NVL's are just PCIe/NVLink paired H100s, you're not really getting much if any benefit on something like vLLM.
I think the more interesting thing critique might be the slightly odd choice of Mixtral 8x7B vs say a more standard Llama2/3 70B (or just test multiple models including some big ones like 8x22B or DBRX.
Also, while I don't have a problem w/ vLLM, as TensorRT gets easier to set up, it might become a factor in comparisons (since they punted on FP8/AMP in this tests). Inferless published a shootoff a couple months ago comparing a few different inference engines: https://www.inferless.com/learn/exploring-llms-speed-benchma...
Price/perf does tell a story, but I think it's one that's mostly about Nvidia's platform dominance and profit margins more than intrinsic hardware advantages. On the spec sheet MI300X has a memory bandwidth and even raw FLOPS advantage but so far it has lacked proper software optimization/support and wide availability (has anyone besides hyperscalers and select partners been able to get them?)
> but I think it's one that's mostly about Nvidia's platform dominance and profit margins more
Profit margins and dominance are result from performance, not the other way around.
It does not matter if Nvidia tools are better when you deploy large number of chips for inference and it does more flops per watt or second. It's seller market and if AMD can't ask high price, their chip do not perform.
----
Question:
People here seem to think that Nvidia has absolutely no advantage in their microarchitecture design skills. It's all in software or monopoly.
> People here seem to think that Nvidia has absolutely no advantage in their microarchitecture design skills. It's all in software or monopoly.
That's an extrapolation. Microarchitecture design skills are not theoretical numbers you manage to put on a spec sheet. You cannot decouple the software driving the hardware - that's not a trivial problem.
> Microarchitecture design skills are not theoretical numbers you manage to put on a spec sheet
not only can you measure this, not only do they measure this, but it's literally the first component of the Rayleigh resolution equation and everyone is constantly optimizing for it all the time.
in the abstract, why does it surprise you that the semiconductor industry would have a way to quantify that?
like, realize that NVIDIA being on a tear with their design has specifically coincided with the point in time when they decided to go all-in on AI (2014-2015 era). Maxwell was the first architecture that showed what a stripped-down architecture could do with neural nets, and it is pretty clear that NVIDIA has been working on this ML-assisted computational lithography and computational design stuff for a while. Since then, I would say - but they've been public about it for several years now (and might be longer, I'd have to look back).
Since that "mid 2010s" moment, it's been Pascal vs Vega, Turing (significant redesign and explicit focus on AI/tensor) vs RDNA1 (significant focus on crashing to desktop), Ampere vs RDNA2, etc. Since then, NVIDIA has almost continuously done more with less: beaten custom advanced tech like HBM with commodity products and small evolutions thereupon (like GDDR5X/6X), matched or beaten the efficiency of extremely expensive TSMC nodes with junk samsung crap they got for a song, etc. Quantitatively by any metric they have done much better than AMD. Like Vega is your example of AMD design? Or RDNA1, the architecture that never quite ran stable? RDNA3, the architecture that still doesn't idle right, and whose MCM still uses so much silicon it raises costs instead of lowering them? Literally the sole generation that's not been a total disaster from The Competition has been RDNA2, so yeah, solid wins and iteration is all it takes to say they are doing quantitatively better, especially considering NVIDIA was overcoming a node disadvantage for most of that. They were focused on bringing costs down, and frankly they were so successful despite that that AMD kinda gave up on trying to outprice them.
Contrast to the POSCAP/MLCC problem in 2020: despite a lot of hype from tech media that it was gonna be a huge scandal/cost performance, NVIDIA patched it dead in a week with basically no perf cost etc. Gosh do you think they might have done some GPGPU accelerated simulations to help them figure that out so quickly, how the chip was going to boost and what the transient surges were going to be etc?
literally they do have better design skills, and some of it is their systems thinking, and some of it is their engineers (they pay better/have better QOL and working conditions, and get the cream of the crop), and some of it is their better design+computational lithography techniques that they have been dogfooding for 3-4 generations now.
people don't get it: startup mentality, founder-led, with a $3t market cap. Jensen is built different. Why wouldn’t they have been using this stuff internally? That’s an extremely Jensen move.
NVidia doesn't have a general advantage in hardware design skills, but they have been focused on AI workloads for quite a while while AMD spent a long time focusing on HPC factors like 64 bit floating point performance.
fair? h100 NVL are two h100 in a single package.. which probably costs 2xh100 or more,
if so ok it's fair to compare 1 mi300x with 1 h100 NVL but then price ( and tco ) should be added to the some metrics conclusion , also the NVL is a 2xpci5.0 quad slot , so not the same thing..
I am not sure about system compatibility and if and how you can stack 8 of those in one system ( like you can do with non NVL and mi300x.. ) so it's a bit a diffent ( and more niche ) beast
AMD is at a much higher PE ratio. Is the market expecting AMD to up its game in the GPU sector? Or is the market expecting a pullback in GPU demand due to possibility for non-GPU AI solutions becoming the frontier or for AI investment to slow down?
I think that the expectation is that NVIDIA is in somewhat of an unreasonable position right now (and for the immediate future) where they're getting about 80% gross margins on their datacenter GPUs. This is an extremely juicy target for competitors, and even if competitors manage to produce a product that's half as good as NVIDIA, NVIDIA will have to cut prices to compete.
Well, there's the beauty of specifying exactly how you ran your benchmark, it is easy to reproduce and disprove or confirm (assuming you got the hardware).
It looks like Runpod currently (checked right now) has "Low" availability of 8x MI300 SXM (8x$4.89/h), H100 NVL (8x$4.39/h), and H100 (8x$4.69/h) nodes for anyone w/ some time to kill that wants to give the shootout a try.
You're joking/trolling right? There are literally 10's of thousands of H100s available on gpulist right now, does that mean there's no cloud demand for Nvidia gpus? (I notice from your comment history that you seem to be some sort of bizarre NVDA stan account, but come on, be serious)
Also, stuff like this is hard to take the results seriously:
* To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.
* All inference frameworks are configured to use FP16 compute paths. Enabling FP8 compute is left for future work.
They did everything they can to make sure AMD is faster.
You need 2 H100 to have enough VRAM for the model whereas you need only 1 MI300X. Doubling the total throughput (for all completions) of 1 MI300X to simulate the numbers for a duplicated system is reasonable.
They should probably show separately the throughput per completion as the tensor parallelism is often used for that purpose in addition to the doubling the VRAM.
I don't understand why they should use TensorRT. vLLM is much more popular and it was actually written for Nvidia. It also supports AMD hardware, so it's the appropriate tool to compare.
I see it as they did everything they can to compare the specific code path. If your workload scales with FP16 but not with tensor cores, then this is the correct way to test. What do you need for LLM inference?
vLLM inference of Mixtral in fp16 is a real workload. I guess the details are there because of the different inference engine used. You need the most similar compute tasks to be ran but the compute kernels can't be the same as in the end they need to be ran by a different hardware.
I suggest taking the report with a grain of salt.