I believe pricing was mid 6 figures per machine. They're also like 8U and water ...

trsohmers · 2024-11-19T06:21:42 1731997302

Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.

The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B

danpalmer · 2024-11-19T06:25:24 1731997524

Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.

YetAnotherNick · 2024-11-19T06:42:08 1731998528

You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.

The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.

latchkey · 2024-11-19T11:49:01 1732016941

Memory bandwidth and memory size. Along with power/cooling density.

Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.

You can see the direction AMD is going with this...

https://www.amd.com/en/products/accelerators/instinct/mi300/...

Const-me · 2024-11-19T12:35:53 1732019753

> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

ryao · 2024-11-20T04:58:42 1732078722

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846

ryao · 2024-11-20T05:01:41 1732078901

Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

YetAnotherNick · 2024-11-20T12:07:09 1732104429

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

ryao · 2024-11-20T19:50:07 1732132207

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

YetAnotherNick · 2024-11-21T04:01:24 1732161684

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

meowface · 2024-11-19T08:52:33 1732006353

Thank you for the breakdown. Bit of an emotional journey.

"$500 in the future...? Oh, $30 million now, so that might be a while..."

jamalaramala · 2024-11-19T09:46:39 1732009599

It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.

I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...

stefs · 2024-11-19T12:10:17 1732018217

well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.

ryao · 2024-11-20T20:00:38 1732132838

There is no need to use smaller models. You can run the biggest models such as llama 3.1 405B on a fairly low end desktop today:

https://github.com/lyogavin/airllm

However, it will be far slower as you said.

ryao · 2024-11-20T05:03:58 1732079038

From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.

As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.

sumedh · 2024-11-19T10:24:41 1732011881

> Based on their S1 filing and public statements

Is it a good stock to buy :)

petra · 2024-11-19T10:29:49 1732012189

Given those details they seem not much better on cost per token than nvidia based systems.

bboygravity · 2024-11-19T06:10:01 1731996601

That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.

schoen · 2024-11-19T07:28:50 1732001330

How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?

My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)

A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?

chaxor · 2024-11-19T08:00:42 1732003242

Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM. Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.

saagarjha · 2024-11-19T08:59:33 1732006773

Apple doesn’t sell those anymore.

chaxor · 2024-11-19T15:41:23 1732030883

Aw man, are they selling only 4GB ones now?

More seriously, even 16GB was essentially the 'norm' in consumer PCs about 15 years ago.

dgfl · 2024-11-19T09:38:48 1732009128

Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.

Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.

In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.

On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking). Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.

It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.

adrian_b · 2024-11-19T11:57:56 1732017476

Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.

The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.

Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.

ryao · 2024-11-20T05:06:23 1732079183

Cerebras has a patent on the technique used to etch across scribe lines. Is there any prior work that would invalidate that patent?

By the way, I am a software developer, so you will not see me challenging their patent. I am just curious.

dheera · 2024-11-19T07:59:43 1732003183

It will also mean 405B models will be uninteresting in 3 to 5 years if we follow the curve we've been on for the past decades.

int_19h · 2024-11-19T09:13:49 1732007629

I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).

initplus · 2024-11-19T06:01:11 1731996071

Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...

Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.

szundi · 2024-11-19T08:41:45 1732005705

Parent wishes 70b not 405b though

wkat4242 · 2024-11-19T09:06:04 1732007164

Yeah but what is in a 4090 is also comparable to a whole rack of servers a decade ago. The tech will get smaller.