More

trsohmers · 2025-05-18T21:46:14 1747604774

Only with the oscillation overthruster flag enabled.

trsohmers · 2025-04-01T04:32:46 1743481966

I was put on it in 2015 after an acquaintance of mine that was previously on the list recommended me… I only heard from Forbes a few days before the list came out, they asked me for a photo and asked if I approved the 2 sentence blurb they prepared, and that was it. For years afterwards they would try to get me to come to their events, but I never had any interest, and I assume that was how they made money… but I never paid anything to be on the list or had real interest in being on it, and I don’t think it led to anything other than my technically illiterate parents thinking that it was impressive.

trsohmers · 2024-11-19T06:21:42 1731997302

Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.

The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B

danpalmer · 2024-11-19T06:25:24 1731997524

Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.

YetAnotherNick · 2024-11-19T06:42:08 1731998528

You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.

The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.

latchkey · 2024-11-19T11:49:01 1732016941

Memory bandwidth and memory size. Along with power/cooling density.

Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.

You can see the direction AMD is going with this...

https://www.amd.com/en/products/accelerators/instinct/mi300/...

Const-me · 2024-11-19T12:35:53 1732019753

> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

ryao · 2024-11-20T04:58:42 1732078722

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846

ryao · 2024-11-20T05:01:41 1732078901

Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

YetAnotherNick · 2024-11-20T12:07:09 1732104429

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

ryao · 2024-11-20T19:50:07 1732132207

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

YetAnotherNick · 2024-11-21T04:01:24 1732161684

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

meowface · 2024-11-19T08:52:33 1732006353

Thank you for the breakdown. Bit of an emotional journey.

"$500 in the future...? Oh, $30 million now, so that might be a while..."

jamalaramala · 2024-11-19T09:46:39 1732009599

It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.

I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...

stefs · 2024-11-19T12:10:17 1732018217

well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.

ryao · 2024-11-20T20:00:38 1732132838

There is no need to use smaller models. You can run the biggest models such as llama 3.1 405B on a fairly low end desktop today:

https://github.com/lyogavin/airllm

However, it will be far slower as you said.

ryao · 2024-11-20T05:03:58 1732079038

From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.

As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.

sumedh · 2024-11-19T10:24:41 1732011881

> Based on their S1 filing and public statements

Is it a good stock to buy :)

petra · 2024-11-19T10:29:49 1732012189

Given those details they seem not much better on cost per token than nvidia based systems.

trsohmers · 2024-10-15T19:56:48 1729022208

Do you think that the 16k GPUs get used once and then are thrown away? Llama 405B was trained over 56 days on the 16k GPUs; if I round that up to 60 days and assume the current mainstream hourly rate of $2/H100/hour from the Neoclouds (which are obviously making margin), that comes out to a total cost of ~$47M. Obviously Meta is training a lot of models using their GPU equipment, and would expect it to be in service for at least 3 years, and their cost is obviously less than what the public pricing on clouds is.

lossolo · 2024-10-15T21:35:59 1729028159

And Meta is using a lot of GPUs for offline ML and online ML features on Instagram, FB etc. So nothing is "wasted".

trsohmers · 2024-08-30T01:48:12 1724982492

+1 this commenter. I just visited the UK for the first time at the beginning of this month and had a fantastic ~3 hours at Bletchley Park, but felt I had to cram TNMOC and the amazing Colossus live demonstration (where I asked a million questions) and everything else in the museum in the 90 minutes I was there. If I assume other HN readers are like me, I would dedicate at least 2.5-3 hours for TNMOC to actually get a chance to actually see and play around with their extensive collection of vintage machines.

trsohmers · on July 16, 2024

They meant that there is no support for Codestral Mamba for llama.cpp yet.

trsohmers · on May 30, 2024

Founder of REX Computing here; I highly recommend checking out my interview on the Microarch Club podcast linked elsewhere on the thread; will also answer questions on this thread if anyone has them.

crest · on May 30, 2024

The teaser reminds me a lot of other *failed* high performance/high efficiency architecture redesigns that failed because of the unreasonable effort required to squeeze out a useful fraction of the promised gains e.g. Transputer and Cell. Can you link to written documentation of how existing code can be ported? I doubt you can just recompile ffmpeg or libx264, but level of toolchain support can early adopters expect? Does it require manually partitioning code+data and mapping it to the on-chip network topology?

trsohmers · on May 30, 2024

We had a basic LLVM backend that supported a slightly modified clang frontend and a basic ABI. We tried to make it drastically easier for both the programmer and compiler to handle memory by having all memory (code+data) be part of a global flat address space across the chip, with guarantees being made to the compiler by the NoC on the latency of all memory accesses across one or multiple chips. We tested this with very small programs that could fit in the local memory of up to two chips (128KB of memory), but in theory it could have scaled up to the 64 bit address space limit. Compilation time for programs was long, but fully automated, specifically to improve upon problems faced by Cell and other scratchpad memory architectures… some of our original funding in 2015 from DARPA was actually for automated scratchpad memory management techniques on Texas Instruments DSPs and Cell (our paper: https://dl.acm.org/doi/pdf/10.1145/2818950.2818966)

This was all designed a decade ago, and REX has been in effectively hibernation since the end of 2017 after successfully taping out our 16 core test chip back in 2016, but being unable to raise additional funding to continue. I have continued to work on architectures that have leveraged scratchpad memories in different ways, including on cryptocurrency and machine learning ASICs, including at my current startup, Positron AI (https://positron.ai)

suranyami · on May 30, 2024

This is inspiring stuff.

Was extremely interested in the Inmos Transputer in the 80s. Seems like an idea way ahead of its time… a bit like REX.

I find the parallels in design with the Actor concurrency model of Erlang, Elixir and Transputer/REX are very compelling.

Really hope something happens with this project or some spinoff from it.

The current interest in RISC-V is testament to the fact that it may still be viable.

I wish you great success. Wish there was a way to sponsor or crowd-source fund this.

CalChris · on May 30, 2024

VLIW has low latency. Why is that important for an inference engine?

trsohmers · on May 9, 2024

This is a lesson that like all good Hitchhikers, you should always carry a towel.

justahuman74 · on May 9, 2024

OP should have posted this on Towel Day

trsohmers · on March 12, 2024

Significantly more than that; MFN pricing for NVIDIA DGX H100 (which has been getting priority supply allocation, so many have been suckered into buying them in order to get fast delivery) is ~$309k, while a basically equivalent HGX H100 system is ~$250k, coming to a price per GPU at the full server level being ~$31.5k. With Meta’s custom OCP systems integrating the SXM baseboards from NVIDIA, my guess is that their cost per GPU would be in the ~$23-$25k range.

fuddle · on March 12, 2024

350,000 NVIDIA H100 x $23k = $8b :0

verticalscaler · on March 12, 2024

Wait till you find out how much they spent on VR.

It is a real loophole in the economy. If you're a trillion dollar company the market will insist you set such sums on fire just to be in the race for $current-hype. If they do it drives their market cap higher still and if they don't they risk being considered un-innovative and therefore doomed to irrelevancy and the market cap will spiral downwards.

Sort of reminds me of The Producers.

oblio · on March 12, 2024

The thing is, this could be considered basic research, right? Basic research IS setting money on fire until (and if) that basic research turns into TCP/IP, Ethernet and the Internet.

verticalscaler · on March 12, 2024

I wish.

Funnily enough Arpanet and all that Xerox stuff were like <$50 million (inflation adjusted!) total. Some real forward thinkers were able to work the system by breaking off a tiny pittance of a much larger budget.

Where as I think this more appropriately can be considered the meta PR budget. They simply can't not spend it, would look bad for Wall Street. Have to keep up with the herd.

infecto · on March 13, 2024

Funny you pick a company that has very little to answer to the markets, out of all the large tech companies, META is the rare one that does not need to answer because Zuckerberg controls the company.

throwaway2037 · on March 13, 2024

    > Funnily enough Arpanet and all that Xerox stuff were like <$50 million (inflation adjusted!) total.

That doesn't say much. The industry was in utter infancy. How much do you think it cost to move Ethernet from 100Mbit/sec to 1GBbit/sec to 10GB to 100GB to 400GB to 800GB? At least one or two orders of magnitude.

How about the cost to build a fab for the Intel 8088 versus a fab that produces 5nm chips running @ 5GHz. Again, at least one or two orders of magnitude.

edmundsauto · on March 13, 2024

This suffers from hindsight bias, at the time it was impossible to know if Arpanet or flying cars was the path forward. A better comparison would be the total sum of investment : payoff ratio, and is not something we can see from where we are now. Only in the future does it make sense to evaluate the success of something. Unfortunately, comparison between eras is difficult to do fairly because conditions are so different between now and Xerox.

lotsofpulp · on March 12, 2024

> If you're a trillion dollar company the market will insist you set such sums on fire just to be in the race for $current-hype. If they do it drives their market cap higher still and if they don't they risk being considered un-innovative and therefore doomed to irrelevancy and the market cap will spiral downwards.

You don’t think earning increasing amounts of tens of billions of dollars in net income per year at some of the highest profit margins in the world at that size for 10+ years has anything to do with market cap?

verticalscaler · on March 13, 2024

$1T Market Cap lets it be known it will invest $10B a year into $current-hype that will change everything. P/E loosens speculatively on sudden new unbounded potential, Market Cap $1.1T. Hype funded. PR as innovator cemented.

throwaway2037 · on March 13, 2024

If you look at the R&D expenditure of Apple, it is mindboggling.

https://www.macrotrends.net/stocks/charts/AAPL/apple/researc...

Roughly 30B USD per year. And what are we getting? Slightly slimmer phones and 3500USD AR/VR headsets?

throwaway2037 · on March 13, 2024

> Market Cap $1.1T. Hype funded.

I'm confused. How does your stock price, which determines market cat, affect your cashflow to fund R&D? It does not.

niels_bom · on March 15, 2024

Capitalism is so weird sometimes.

bigcat12345678 · on March 13, 2024

Would you kindly provide sources to the numbers? What is MFN?

Thanks! (Your number is consistent with what I hear of, but I never managed to get solid sources to back them up)

trsohmers · on March 11, 2024

The quote from the linked press release is that they do training on TPUv4, while inference is running on GPUs. I have also heard this separately from people associated with Midjourney recently, and that they solely do training on TPUs.