Hacker Newsnew | past | comments | ask | show | jobs | submit | neilmovva's commentslogin

The multilingual example in the launch graphic has Qwen3 producing the text:

> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"

translation: "Hello, could you tell me how to get to Tiananmen Square?"

a bold choice!


Westerners only know it from the massacre but it’s actually just like Times Square for them

Not really, it's a significant place which is why the protest (and hence massacre) was there, so especially for Chinese people (I expect) merely referencing it doesn't so immediately refer to the massacre, they have plenty of other connotations for it.

e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)

Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.

No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.


Not to mention, Tiananmen Square is one of the major tourist destinations in Beijing (similar to National Mall in Washington DC), for both domestic and foreign visitors.

This is true. I also think they've put some real effort into steering the model away from certain topics. If you ask too closely you'll get a response like:

"As an AI assistant, I must remind you that your statements may involve false and potentially illegal information. Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak."


only really a reference with the date or at least 89

I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so they are pretty close in performance per dollar.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in ML training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, and is removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 3090 than A100). They are generous with memory bandwidth though, nearly 2TB/s on 5090 is amazing!


Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?


You're asking a really good question but it's not a question with an easy answer.

There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.

To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?

On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.

Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.

I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.


No, GPUs are a lot simpler. You can mostly just take the clock rate and scale it directly for the instruction being compared.


There's a 2x performance hit from the weird restriction on fp32 accumulation, plus the fact that 5090 has "fake" Blackwell (no tcgen05) which limits the size and throughput of matrix multiplication through the tensor cores.


Isn't 5090 FE (roughly 2500 USD in my country) pretty good FLOP value? 32 GB VRAM (and flash attention pushes it even faster compared to apple/mps relatively cheap "vram")


Not really:

5090: 210 TF / $2k == 105 TF/$k

B200: 2250 TF / $40k == 56 TF/$k

Getting only 2x the FLOPs per dollar probably isn't worth the hassle of having to rack 10x as many GPUs, while having no NVLink.


One of the reasons they removed NVLink from consumer cards (they supported it before). There’s also an issue with power consumption (1xB200 vs 10x5090)


Sure, but when spending 20x more, getting almost twice the compute per buck seems expected


Isn't the new trend to train in lower precision anyway?


Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586


Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway


Only GPU-poors run Q-GaLore and similar tricks.


Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.


For inference of course; the OP I replied to mentioned training though.


Do you have a source for that B200 price?


A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!

The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.


> A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020.

This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.

They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.

We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.


> This is one of the secret recipes of Intel

Any other examples of this? I remember the secret sauce being a process advantage over the competition, exactly the opposite of making old tech outperform the state of the art.


Intels surprisingly fast 14nm processors come to mind. Born of necessity as they couldn't get their 10 and later 7nm processes working for years. Despite that Intel managed to keep up in single core performance with newer 7nm AMD chips, although at a mich higher power draw.


That's because CPU performance cares less about transistor density and more about transistor performance, and 14nm drive strength was excellent


For like half of 14nm intel era, there was no competition on CPU market in any segment for them. Intel was able to improve their 14nm process and be better at branch prediction. Moving things to hardware implementation is what kept improving.

This isn't the same as getting more out of the same over and over again.


Or today with Alder Lake and Raptor Lake(Refresh), where their CPUs made on Intel 7 (10nm) are on par if not slightly better than AMD's offerings made on TSMC 5nm.


Back in the day, Intel was great for overclocking because all of their chips could run at significantly higher speeds and voltages than on the tin. This was because they basically just targeted the higher specs, and sold the underperforming silicon as lower-tier products.

Don't know if this counts, but feels directionally similar.


Interesting.

Would you say this means Intel is "back," or just not completely dead?


No, this means Intel has woken up and trying. There's no guarantee in anything. I'm more of an AMD person, but I want to see fierce competition, not monopoly, even if it's "my team's monopoly".


Well the only reason why AMD is doing good at CPU is becoming Intel is sleeping. Otherwise it would be Nvidia vs AMD (less steroids though).


EPYC is actually pretty good. It’s true that Intel was sleeping, but AMD’s new architecture is a beast. Has better memory support, more PCIe lanes and better overall system latency and throughput.

Intel’s TDP problems and AVX clock issues leave a bitter taste in the mouth.


Oh dear, Q6600 was so bad, I regret ever owning it


Q6600 was quite good but E8400 was the best.


Q6600 is the spiritual successor to the ABIT BP6 Dual Celeron option: https://en.wikipedia.org/wiki/ABIT_BP6


ABIT was a legend in motherboards. I used their AN-7 Ultra and AN-8 Ultra. No newer board gave the flexibility and capabilities of these series.

My latest ASUS was good enough, but I didn't (and probably won't) build any newer systems, so ABITs will have the crown.


The ABit BP6 bought me so much "cred" at LAN Parties back in the day - the only dual socket motherboard in the building, and paired with two Creative Voodoo 2 GPUs in SLI mode, that thing was a beast (for the late nineties).

I seem to recall that only Quake 2 or 3 was capable of actually using that second processor during a game, but that wasn't the point ;)


E8400 was actually good, yes


What? It was outstanding for the time, great price performance, and very tunable for clock / voltage IIRC.


Well overclocked I don't know, but out-of-the box single-core performance completely sucked. And in 2007 not enough applications had threads to make it up in the number of cores.

It was fun to play with but you'd also expect the higher-end desktop to e.g. handle x264 videos which was not the case (search for q6600 on videolan forum). And depressingly many cheaper CPUs of the time did it easily.


I owned one, it was a performant little chip. Developed my first multi core stuff with it.

I loved it, to be honest.


65nm tolerated a lot of voltage. Fun thing to overclock.


Really? I never owned one but even I remember the famous SLACR, I thought they were the hot item back then


It was "hot" but using one as a main desktop in 2007 was depressing due to abysmal single-core performance.


I was just about to comment on this, apparently all production capacity for hbm is tapped out until early 2026


promising results, excited to try it out!

question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case? Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising.


So there's 2 main reasons:

1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.

2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.


I agree that synchronization causes overhead, so 2x GPUs won't achieve the ideal 0.5x total runtime. But here, taking your Alpaca benchmark as an example, we are seeing 2x GPUs get 3.6x runtime with Huggingface, or 1.15x with Unsloth Max.

In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.

Could you share your benchmark code?


You can refer to QLoRA's official finetuning notebook https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zb... for your reference!! Obviously I can't provide the code we have, but if you use the same datasets and the same settings (bsz = 2, ga = 4, max_grad_norm = 0.3, num_epochs = 1, seed = 3407, max_seq_len = 2048) you should be able to replicate it.


I can't universally agree with the headline statement. The article focuses on the pros of SRAM, which are real -- peak bandwidth (e.g. 5 TB/s out of the H100’s L2) and lower energy per bit transferred (the rule of thumb I remember is ~10x lower than HBM).

But the companies who already bet big on SRAM in AI, Cerebras in particular and Graphcore to lesser extent, aren’t obviously running away with the AI performance crown. Seems like LLMs need more memory capacity than anyone expected, to the point where even HBM stacks somewhat limit the scale of current models. Maybe the next version of Cerebras WSE can get closer to 100GB of on-chip memory, and serve some useful LLMs very efficiently - excited to see what they can do with more modern processes!

I think innovation in SRAM packaging, like AMD’s stacking in “3D V-Cache”, is also promising and might play a larger role in AI accelerators going forward. But it’s important to note that while both are SRAM, performance of AMD’s stacked L3 is not yet comparable to a GPU’s centralized L2 - it’s more like HBM today in latency and bandwidth.


I like this review:

https://www.lighterra.com/papers/modernmicroprocessors/

A bit dated, but the major ideas used in current CPUs are all covered!


This looks perfect for my needs, thanks a bunch!


> moves to Austin because it is less “vulnerable to climate change”

> commutes by plane

hmmm


I was cackling at this part. How could anyone in Texas say that with a straight face after the big ice storm two years ago? The climate is changing everywhere. The idea that you can move away from the climate is simply deluded.


I'll just move outside the environment.

https://youtu.be/3m5qxZm_JqM


It's not necessarily getting worse everywhere, which I would guess is the point.


> It's not necessarily getting worse everywhere

Among the many issues with this statement: It takes food security for granted.


The difference is presumably between more extreme weather and becoming submerged.


My thoughts exactly:

>“By the time I get to bed it’s going to be close to a 24-hour day,” says Frank Croasdale of some of the crazier days he commutes to his physical-therapy practice in Redondo Beach, Calif., from his home in Austin, Texas. Still, the arrangement “is a win-win.”

..."except for the environment of course", I was thinking when reading that.


Think globally, act (kind of) locally.


Lolsob


A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100.

The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.


It is interesting that hopper isn’t widely available yet.

I have seen some benchmarks from academia but nothing in the private sector.

I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.

Not having any competition whatsoever means Nvidia can release what they like when they like.


The question is, do they not have much production, or is OpenAI and Microsoft buying every single one they produce?


Why bother when you can get cryptobros paying way over MSRP for 3090s?


GPU mining died last year.

There's so little liquidity post-merge that it's only worth mining as a way to launder stolen electricity.

The bitcoin people still waste raw materials, and prices are relatively sticky with so few suppliers and a backlog of demand, but we've already seen prices drop heavily since then.


Right, that's why NVidia is acutally trying again. The money printer has run out of ink.


Not just cryptobros. A100s are the current top of the line and it’s hard to find them available on AWS and Lambda. Vast.AI has plenty if you trust renting from a stranger.

AMD really needs to pick up the pace and make a solid competitive offering in deep learning. They’re slowly getting there but they are at least 2 generations out.


I would take a huge performance hit to just not deal with Nvidia drivers. Unless things have changed, it is still not really possible to operate on AMD hardware without a list of gotchas.


Its still basically impossible to find MI200s in the cloud.

On desktops, only the 7000 series is kinda competitive for AI in particular, and you have to go out of your way to get it running quick in PyTorch. The 6000 and 5000 series just weren't designed for AI.


It's crazy to me that no other hardware company has sought to compete for the deep learning training/inference market yet ...

The existing ecosystems (cuda, pytorch etc) are all pretty garbage anyway -- aside from the massive number of tutorials it doesn't seem like it would actually be hard to build a vertically integrated competitor ecosystem ... it feels a little like the rise of rails to me -- is a million articles about how to build a blog engine really that deep a moat ..?


How could their moat possibly be deeper?

First of all you need hardware with cutting-edge chips. Chips which can only be supplied by TSMC and Samsung.

Then you need the software ranging all the way from the firmware and driver over something analogous to CUDA with libraries like cuDNN, cuBLAS and many others to integrations into pytorch and tensorflow.

And none of that will come for free, like it came to Nvidia. Nvidia built CUDA and people built their DL frameworks around it in the last decade, but nobody will invest their time into doing the same for a competitor, when they could just do their research on Nvidia hardware instead.

Realistically it's up to AMD or Intel.


There will probably be Chinese options as well. China has an incentive to provide a domestic competitor due to deteriorating relations with the U.S.


They certainly will have to try, since nvidia is banned from exporting A100 and H100 chips.


They do ship A800 and H800 to China. H800 is the H100 with a much slower memory bandwidth. A800 is also a tiered down version of the A100


No other company has sought this?

https://www.cerebras.net/ Has innovative technology, has actual customers, and is gaining a foothold in software-system stacks by integrating their platform into the OpenXLA GPU compiler.


There are tons of companies trying; they just aren't succeeding.


Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.

I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.


You can also join a pair of regular PCIe H100 GPUs with an NVLink bridge. So that topology is not so new either.


>H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find

You can safely assume an entity bought as many as they could.


Yes, we all place a lot of trust in cloud vendors today. FHE is a way to move the trust boundary back to the client - let the server be as malicious or insecure as it wants. Raw compute could even become much cheaper, since any machine anywhere can be a supplier in the market for untrusted CPU time.


Thanks for the feedback, I understand your hesitation. We don't just want to advertise guarantees - we want you to never trust third-party servers again. Fully homomorphic encryption makes this possible by never letting sensitive data even leave your device. Our job is to make this new cryptography a web standard as ubiquitous as TLS.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: