If you have an >=M1-class machine with sufficient RAM, the medium-sized models t...

noman-land · on April 1, 2024

I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.

tgma · on April 1, 2024

I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.

tchvil · on April 1, 2024

By any chance, do you have a good link to some help with the installation?

yaantc · on April 1, 2024

Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.

In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.

    [1] https://github.com/Mozilla-Ocho/llamafile/
    [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true
    [3] https://cosmo.zip/pub/cosmos/bin/assimilate

tchvil · on April 1, 2024

Thank you !

tgma · on April 1, 2024

Either https://lmstudio.ai (desktop app with nice GUI) or https://ollama.com (command-like more like a docker container that you can also hook up to a web UI via https://openwebui.com) should be super straightforward to get running.

tchvil · on April 1, 2024

Thank you for letting me know it was possible on an M1. I'll try all this now.

chown · on April 1, 2024

I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you try it, let me know what you think.

1: https://msty.app

omnibrain · on April 2, 2024

Looks great. Can you recommend what GPU to get to just play with the models for a bit? (I want to have perform it fast, otherwise I lose interest too quickly). Are consumer GPUs like the RTX 4080 Super sufficient, or do I need anything else?

yunohn · on April 1, 2024

Why is this both free and closed source? Ideally, when you advertise privacy-first, I’d like to see a GitHub link with real source code. Or I’d rather pay for it to ensure you have a financial incentive to not sell my data.

chown · on April 1, 2024

It will be paid down the road, but we are not there yet. It’s all offline, data is locally saved. You own it, we don’t have it even if you ask for it.

yunohn · on April 2, 2024

There’s incredible competition in this space already - I’d highly recommend outright stating your future pricing plans, instead of a bait-and-switch later.

tchvil · on April 1, 2024

I'll try in a week+ when I'm back to a fast connection. Thank you.

firewolf34 · on April 2, 2024

Check out PrivateGPT on GitHub. Pretty much just works put of the box. I got Mistral7B running on a GTX 970 in about 30 minutes flat first try. Yep, that's the triple-digit GTX 970.

bongobingo1 · on April 1, 2024

What is sufficient RAM in that case? 30gb+? Or can you get by streaming it?

AaronFriel · on April 1, 2024

30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.

To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).

An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.

Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.

Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.

supposemaybe · on April 1, 2024

It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.