Do you mean person ITT saying all? Whether it’s two people or a dozen people, if there are people ITT, using the same word, then they’re all using the word.
We’re already past that point! MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B. These models run at very comfortable speeds on Apple Silicon. And they are distinctly more capable and less prone to hallucination than GPT-3.5 was.
Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.
OMM, Llama 3.3 70B runs at ~7 text generation tokens per second on Macbook Pro Max 128GB, while generating GPT-4 feeling text with more in depth responses and fewer bullets. Llama 3.3 70B also doesn't fight the system prompt, it leans in.
Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX) centered UI, include MLX in your search term when downloading models.
Also, do not scrimp on the storage. At 60GB - 100GB per model, it takes a day of experimentation to use 2.5TB of storage in your model cache. And remember to exclude that path from your TimeMachine backups.
It's all memory bandwidth related -- what's slow is loading these models into memory, basically. The last die from Apple with all the channels was the M2 Ultra, and I bet that's what tops those leader boards. M4 has not had a Max or an Ultra release yet; when it does (and it seems likely it will), those will be the ones to get.
You could definitely run an 8B model on that, and some of those are getting very capable now.
The problem is that often you can't run anything else. I've had trouble running larger models in 64GB when I've had a bunch of Firefox and VS Code tabs open at the same time.
I have a M2 Air with 24GB, and have successfully run some 12B models such as mistral-nemo. Had other stuff going as well, but it's best to give it as much of the machine as possible.
I recently upgraded to exactly this machine for exactly this reason, but I haven't taken the leap and installed anything yet. What's your favorite model to run on it?
I bought an old used desktop computer, a used 3090, and upgraded the power supply, all for around 900€. Didn't assemble it all yet. But it will be able to comfortably run 30B parameter models with 30-40 T/s. The M4 Max can do ~10 T/s, which is not great once you really want to rely on it for your productivity.
Yes, it is not "local" as I will have to use the internet when not at home. But it will also not drain the battery very quickly when using it, which I suspect would happen to a Macbook Pro running such models. Also 70B models are out of reach of my setup, but I think they are painfully slow on Mac hardware.
I'm returning my 96GB m2 max. It can run unquantized llama 3.3 70B but tokens per second is slow as molasses and still I couldn't find any use for it, just kept going back to perplexity when I actually needed to find an answer to something.
Llama 8B is multilingual on paper, but the quality is very bad compared to English. It generally understands grammar, and you can understand what it's trying to say, but the choice of words is very off most of the time, often complete gibberish. If you can imagine the output of an undertrained model, this is it. Meanwhile GPT3.5 had far better output that you could use in production.
> gpt-3.5-turbo is generally considered to be about 20B params. An 8B model does not exceed it.
The industry has moved on from the old Chinchilla scaling regime, and with it the conviction that LLM capability is mainly dictated by parameter count. OpenAI didn't disclose how much pretraining they did for 3.5-Turbo, but GPT 3 was trained on 300 billion tokens of text data. In contrast, Llama 3.1 was trained on 15 trillion tokens of data.
Objectively, Llama 3.1 8B and other small models have exceeded GPT-3.5-Turbo in benchmarks and human preference scores.
> Is a $8000 MBP regular consumer hardware?
As user `bloomingkales` notes down below, a $499 Mac Mini can run 8B parameter models. An $8,000 expenditure is not required.
>> Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4
> I'm skeptical; the llama 3.1 405B model is the only comparable model I've used, and it's significantly larger than the 70B models you can run locally.
Every new Llama generation achieved to beat larger models of the previous generation with smaller ones.
And benchmarks are very misleading in this regard. We've seen no shortage of even 8B models claiming that they beat GPT-4 and Claude in benchmarks. Every time this happens, once you start actually using the model, it's clear that it's not actually on par.
Here's a config for around the same price. All brand new parts for 573. You can spend the difference improving any part you wish, or maybe get an used 3060 and go AM5 instead (Ryzen 8400F). Both paths are upgradeable.
Double the LLM performance. Half the desktop performance. But you can use both at the same time. Your computer will not slow down when running inference.
Then you can plug a GPU on it. It should have decent load times. Better than an eGPU, worse than the AM4 desktop build, fast enough to beat the M4 (once the data is in the GPU, it doesn't matter).
It makes for a very portable setup. I haven't built it, but I think it's a reasonable LLM choice comparable to the M4 in speed and portability while still being upgradable.
Edit: and you'll need an external power supply of at least 400W:)
I only use local, open source LLMs because I don’t trust cloud-based LLM hosts with my data. I also don’t want to build a dependence on proprietary technology.
We're there, Llama 3.1 8B beats Gemini Advanced for $20/month. Telosnex with llama 3.1 8b GGUF from bartowski. https://telosnex.com/compare/ (How!? tl;dr: I assume Google is sandbagging and hasn't updated the underlying Gemini)
Saying these models are at GPT-4 level is setting anyone who doesn't place special value on the local aspect up for disappointment.
Some people do place value on running locally, and I'm not against then for it, but realistically no 70B class model has the amount of general knowledge or understanding of nuance as any recent GPT-4 checkpoint.
That being said these models are still very strong compared to what we had a year ago and capable of useful work
I have to disagree. I understand it's very expensive, but it's still a consumer product available to anyone with a credit card.
The comparison is between something you can buy off the shelf like a powerful Mac, vs something powered by a Grace Hopper CPU from Nvidia, which would require both lots of money and a business relationship.
Honestly, people pay $4k for nice TVs, refrigerators and even couches, and those are not professional tools by any stretch. If LLMs needed a $50k Mac Pro with maxed out everything, that might be different. But anything that's a laptop is definitely regular consumer hardware.
There's definitely been plenty sources of hardware capable of running LLMs out there for a while, Mac or not. A couple 4090s or P40s will run 3.1 70b. Or, since price isn't a limit, there are other easier & more powerful options like a [tinybox](https://tinygrad.org/#tinybox:~:text=won%27t%20be%20consider...).
Yeah, a computer which starts at $3900 is really stretching that classification. Plus if you're that serious about local LLMs then you'd probably want the even bigger RAM option, which adds another $800...
An optioned up minivan is also expensive but doesn’t cost as much as a firetruck. It’s expensive but still very much consumer hardware. A 3x4090 rig is more expensive and still consumer hardware. An H100 is not, you can buy like 7 of these optioned up MBP for a single H100.
In my experience, people use the term in two separate ways.
If I'm running a software business selling software that runs on 'consumer hardware' the more people can run my software, the more people can pay me. For me, the term means the hardware used by a typical-ish consumer. I'll check the Steam hardware survey, find the 75th-percentile gamer has 8 cores, 32GB RAM, 12GB VRAM - and I'd better make sure my software works on a machine like that.
On the other hand, 'consumer hardware' could also be used to simply mean hardware available off-the-shelf from retailers who sell to consumers. By this definition, 128GB of RAM is 'consumer hardware' even if it only counts as 0.5% in Steam's hardware survey.
On the Steam Hardware Survey the average gamer uses a computer with a 1080p display too. That doesn't somehow make any gaming laptop with a 2k screen sold in the last half decade a non-consumer product. For that matter the average gaming PC on Steam is even above average relative to the average computer. The typical office computer or school Chromebook is likely several generations older doesn't have an NPU or discrete GPU at all.
For AI and LLMs, I'm not aware of any company even selling the models assets directly to consumers, they're either completely unavailable (OpenAI) or freely licensed so the companies training them aren't really dependent what the average person has for commercial success.
That's basically a new standard. I like it, but getting all JSON parsers on board will be tricky if not impossible. Maybe it deserves its own name. How about JSONc (JAYSON-SEE)?
Edit: I read the fine print on the page after posting this... Turns out that I'm not the first to come up with jsonc. The author just isn't a huge fan and wants JSON parsers to accept comments. Good luck with that!
reply