> But that isn't GPU memory right? On the Mac it is. They call it that but it's ...

ErneX · 2024-03-09T08:45:35 1709973935

But if you need to get 2x consumer GPUs seems to me the reason is not for the compute capabilities but rather to be able to fit the model on the VRAM of both. So what exactly does having lots of memory on a server help with this when it’s not memory the GPU can use unlike on Apple Silicon computers?

AnthonyMouse · 2024-03-09T10:04:10 1709978650

The problem with LLMs is that the models are large but the entire model has to be read for each token. If the model is 40GB and you have 80GB/s of memory bandwidth, you can't get more than two tokens per second. That's about what you get from running it on the CPU of a normal desktop PC with dual channel DDR5-5200. You can run arbitrarily large models by just adding memory but it's not very fast.

GPUs have a lot of memory bandwidth. For example, the RTX-4090 has just over 1000GB/s, so a 40GB model could get up to 25 tokens/second. Except that the RTX-4090 only has 24GB of memory, so a 40GB model doesn't fit in one and then you need two of them. For a 128GB model you'd need six of them. But they're each $2000, so that sucks.

Servers with a lot of memory channels have a decent amount of memory bandwidth, not as much as high-end GPUs but still several times more than desktop PCs, so the performance is kind of medium. Meanwhile they support copious amounts of cheap commodity RAM. There is no GPU, you just run it on a CPU with a lot of cores and memory channels.

ErneX · 2024-03-09T11:59:51 1709985591

Got it, thanks!

int_19h · 2024-03-08T20:01:23 1709928083

It's fast enough to do realtime (7 tok/s) chat with 120b models.

And yes, of course it's not magic, and in principle there's no reason why a dedicated LLM-box with heaps of fast DDR5 couldn't cost less. But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself. The beauty of Mac Studio is that you just plug it in, and it works.

AnthonyMouse · 2024-03-08T21:02:12 1709931732

> It's fast enough to do realtime (7 tok/s) chat with 120b models.

Please list quantization for benchmarks. I'm assuming that's not the full model because that would need 256GB and I don't see a Studio model with that much memory, but q8 doubles performance and q4 quadruples it (with corresponding loss of quality).

> But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself.

You can just buy a complete server from a vendor or eBay, but this costs more because they'll try to constrain you to a particular configuration that includes things you don't need, or overcharge for RAM etc. Which is basically the same thing Apple does.

Whereas you can buy the barebones machine and then put components in it, which takes like fifteen minutes but can save you a thousand bucks.

sbierwagen · 2024-03-09T05:54:03 1709963643

6.3 tok/s has been demonstrated on q4_0 Falcon 180B on the 192gb Mac studio: https://x.com/ggerganov/status/1699791226780975439?s=46&t=ru...

staticman2 · 2024-03-11T02:52:53 1710125573

A lot of people are reporting low context sizes tokens per second without discussing how slow it is at bigger sizes. In some cases they are also not mentioning time to first inference. If you dig around /LocalLLaMA you can find some posts with better bench-marking.

int_19h · 2024-03-08T23:27:21 1709940441

That is with 4-bit quantization. For practical purposes I don't see the point of running anything higher than that for inference.

AnthonyMouse · 2024-03-09T00:19:28 1709943568

That's interesting though, because it implies the machine is compute-bound. A 4-bit 120B model is ~60GB, so you should get ~13 tokens/second out of 800GB/s if it was memory-bound. 7/s implies you're getting ~420GB/s.

And the Max has half as many cores as the Ultra, implying it would be compute-bound too.