If you have an >=M1-class machine with sufficient RAM, the medium-sized models that are on the order of 30GB in size perform decently on many tasks to be quite useful without leaking your data.
I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.
I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.
Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.
In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.
Either https://lmstudio.ai (desktop app with nice GUI) or https://ollama.com (command-like more like a docker container that you can also hook up to a web UI via https://openwebui.com) should be super straightforward to get running.
I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you try it, let me know what you think.
Looks great. Can you recommend what GPU to get to just play with the models for a bit? (I want to have perform it fast, otherwise I lose interest too quickly).
Are consumer GPUs like the RTX 4080 Super sufficient, or do I need anything else?
Why is this both free and closed source? Ideally, when you advertise privacy-first, I’d like to see a GitHub link with real source code. Or I’d rather pay for it to ensure you have a financial incentive to not sell my data.
There’s incredible competition in this space already - I’d highly recommend outright stating your future pricing plans, instead of a bait-and-switch later.
Check out PrivateGPT on GitHub. Pretty much just works put of the box. I got Mistral7B running on a GTX 970 in about 30 minutes flat first try. Yep, that's the triple-digit GTX 970.
30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.
To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).
An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.
Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.
Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.
It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.