Hacker News new | past | comments | ask | show | jobs | submit login

But that isn't GPU memory right? On the Mac it is.



> But that isn't GPU memory right? On the Mac it is.

They call it that but it's really LPDDR5, i.e. normal DRAM, using a wide memory bus. Which is the same thing servers do.

The base M3, with "GPU memory", has 100GB/s, which is less than even a cheap desktop PC with dual channel DDR5-6400. The M3 Pro has 150GB/s. By comparison a five year old Epyc system has 8 channels of DDR4-3200 with more than 200GB/s per socket. The M3 Max has 300-400GB/s. Current generation servers have 12 channels of DDR5-4800 with 460GB/s per socket, and support multi-socket systems.

The studio has 800GB/s, which is almost as much as the modern dual socket system (for about the same price), but it's not obvious it has enough compute resources to actually use that.


But if you need to get 2x consumer GPUs seems to me the reason is not for the compute capabilities but rather to be able to fit the model on the VRAM of both. So what exactly does having lots of memory on a server help with this when it’s not memory the GPU can use unlike on Apple Silicon computers?


The problem with LLMs is that the models are large but the entire model has to be read for each token. If the model is 40GB and you have 80GB/s of memory bandwidth, you can't get more than two tokens per second. That's about what you get from running it on the CPU of a normal desktop PC with dual channel DDR5-5200. You can run arbitrarily large models by just adding memory but it's not very fast.

GPUs have a lot of memory bandwidth. For example, the RTX-4090 has just over 1000GB/s, so a 40GB model could get up to 25 tokens/second. Except that the RTX-4090 only has 24GB of memory, so a 40GB model doesn't fit in one and then you need two of them. For a 128GB model you'd need six of them. But they're each $2000, so that sucks.

Servers with a lot of memory channels have a decent amount of memory bandwidth, not as much as high-end GPUs but still several times more than desktop PCs, so the performance is kind of medium. Meanwhile they support copious amounts of cheap commodity RAM. There is no GPU, you just run it on a CPU with a lot of cores and memory channels.


Got it, thanks!


It's fast enough to do realtime (7 tok/s) chat with 120b models.

And yes, of course it's not magic, and in principle there's no reason why a dedicated LLM-box with heaps of fast DDR5 couldn't cost less. But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself. The beauty of Mac Studio is that you just plug it in, and it works.


> It's fast enough to do realtime (7 tok/s) chat with 120b models.

Please list quantization for benchmarks. I'm assuming that's not the full model because that would need 256GB and I don't see a Studio model with that much memory, but q8 doubles performance and q4 quadruples it (with corresponding loss of quality).

> But in practice, I'm not aware of any actual offerings in this space for comparable money that do not involve having to mess around with building things yourself.

You can just buy a complete server from a vendor or eBay, but this costs more because they'll try to constrain you to a particular configuration that includes things you don't need, or overcharge for RAM etc. Which is basically the same thing Apple does.

Whereas you can buy the barebones machine and then put components in it, which takes like fifteen minutes but can save you a thousand bucks.


6.3 tok/s has been demonstrated on q4_0 Falcon 180B on the 192gb Mac studio: https://x.com/ggerganov/status/1699791226780975439?s=46&t=ru...


A lot of people are reporting low context sizes tokens per second without discussing how slow it is at bigger sizes. In some cases they are also not mentioning time to first inference. If you dig around /LocalLLaMA you can find some posts with better bench-marking.


That is with 4-bit quantization. For practical purposes I don't see the point of running anything higher than that for inference.


That's interesting though, because it implies the machine is compute-bound. A 4-bit 120B model is ~60GB, so you should get ~13 tokens/second out of 800GB/s if it was memory-bound. 7/s implies you're getting ~420GB/s.

And the Max has half as many cores as the Ultra, implying it would be compute-bound too.


The issue here isn't specifically about the classification of memory, be it "unified memory," RAM, or VRAM. The primary concern is ensuring there's enough memory capacity for the models required for inference. The real question at hand is the Mac's value proposition in terms of inference speed, particularly for models as large as 70 billion parameters. Utilizing a 4090 GPU can facilitate real-time inference, which is the desired outcome for most users. In contrast, a Mac Studio offers close to real-time inference speeds, which might be disappointing for users expecting a real-time experience. Then, there's the option of CPU + RAM-based inference, which suits scenarios where immediate responses aren't crucial, allowing for batch processing of prompts and subsequent retrieval of responses. Considering the price points of both the Mac Studio and high-end GPUs are relatively comparable, it begs the question of the practicality and value of near real-time inference in specific use cases.


Considering that the topic is approachability and energy efficiency, that Mac Studio will do reasonably fast inference while consuming <200W at full load.

The speed is certainly not comparable to dedicated GPUs, but the power efficiency is ridiculous for a very usable speed and no hardware setup.


This, and then you get to have a Mac Studio.

I have one, where I selected an M1 Ultra and 128G RAM to facilitate just this sort of thing. But in practice, I'm spending much more time using it to edit 4K video, and as a recording studio/to develop audio plugins on, and to livestream while doing these things.

Turns out it's good at these things, and since I have the LLAMA 70b language model at home and can run it directly unquantized (not at blinding speed, of course, but it'll run just fine), I'm naturally interested in learning how to fine tune it :)


Yep, I also got mine specifically for LLMs and ended up using it as a second desktop for other things; actually strongly considering making it my primary at this point.

I still wouldn't recommend it to someone just looking for a powerful desktop, just because $3K is way overpriced for what you get (non-replaceable 1Tb SSD is so Apple!). But it's certainly great if you already have it...


"Real-time" is a very vague descriptor. I get 7-8 tok/s for 70b model inference on my M1 Mac - that's pretty real-time to me. Even Professor-155b runs "good enough" (~3 tok/s) for what I'd consider real-time chat in English.


[flagged]


I'm not a gpt. But now you could say that this is exactly how a gpt would answer and we get stuck in a loop and there's no obvious way to prove that I'm not a gpt.


Interesting. Let's review the comment.

> The issue here isn't specifically about the classification of memory, be it "unified memory," RAM, or VRAM. The primary concern is ensuring there's enough memory capacity for the models required for inference.

The comment chain is about training, not inference.

> The real question at hand is the Mac's value proposition in terms of inference speed, particularly for models as large as 70 billion parameters.

Again, wrong topic.

> Utilizing a 4090 GPU can facilitate real-time inference, which is the desired outcome for most users.

Generic statement. Semantically empty. Typical LLM style.

> In contrast, a Mac Studio offers close to real-time inference speeds, which might be disappointing for users expecting a real-time experience.

Tautological generic statement. Semantically empty. Typical LLM style.

> Then, there's the option of CPU + RAM-based inference, which suits scenarios where immediate responses aren't crucial, allowing for batch processing of prompts and subsequent retrieval of responses.

Contradicts first sentence that "classification of memory" isn't important. Fails to recognize this the same category as previous statement. Subtle shift from first sentence that declared "primary concern is ... membory capacity", to focusing purely on performance. This kind of incoherent shift is common in LLM output.

> Considering the price points of both the Mac Studio and high-end GPUs are relatively comparable, it begs the question of the practicality and value of near real-time inference in specific use cases.

Completes shift from memory capacity to performance. Compares not really comparable things. "Specific use cases" is a tell-tale LLM marker. Semantically empty.


I feel the need to point out that people, who spend many hours writing with an LLM, will eventually start writing like the LLM.


I'm definitely guilty of this. Especially for non native speaker who might more easily lend towards adapting phrases from others (including gpts), because they are not sure how to phrase it correctly.


antinf congratulations I think you have proven, beyond any doubt, that I'm a gpt.

(This is a semantically empty tautological generic statement.)


'Write me something profane?' That probably weeds out commercially available GPTs?


No, this is a common fallacy. You can tell ChatGPT, one of the most infamously hobbled GPTs, in custom instructions that you do want profanity, and it will oblige. This is not a jailbreak, this is supported behavior.


I do think that there is a certain amount of gpt bot activity present on HN. But I don't think it makes sense to call people out and saying they are a gpt just based on one comment.


I'm sorry, but I can't fulfill that request. Is there anything else I might assist you with? ;)





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: