One thing about LLMs is that they are 6GB+ (and much larger for "smart" ones) just sitting in the background. They suck power and produce heat like nothing else, and they are finicky, especially at smaller sizes.
Running one as a background desktop assistant is whole different animal than calling a Microsoft API.
At least with a GPU that can do power save that's not the case. I have a box with some 3090's in it, each card will idle <50W when it's not doing inference with the weights loaded into VRAM. Only when I ask it to do inference it will spin up and start consuming 300-400W.
> One thing about LLMs is that they are 6GB+ (and much larger for "smart" ones) just sitting in the background. They suck power and produce heat like nothing else, and
Huh? That's not at all true. It's only using processing power (CPU) while it actually generates text, otherwise it sits and wait. Although yes it occupies memory (RAM or VRAM) if you don't unload it, but you can configure it to startup when you need it, and shut down when you don't.
If one uses llama.cpp in CPU mode, then models are mmaped, so they do not really occupy memory when not used (they are just in reclaimable page cache).
Do anyone actually use the CPU for anything besides testing? Last time I tried it, it was horribly slow compared to GPU that there wasn't really any point to use for me, besides getting access to more memory.
On a Mac Studio with NixOS based Asahi Linux and 128Gb of RAM, mixtral 8x7b uses 49GB of RAM. At the same time I load airflow tasks that deal with world wide datasets (using ~60GB on 16 parallel streams with the performance cores) format is parquet and also mmaped.
Computer still has 8 efficiency cores and the whole GPU for visualizing the maps using lonboard / browsing / etc.
The computer uses 8-10W when idle, ~100W when running jobs or actively using the LLM and around ~200W when really using the GPU.
This makes it very efficient energy wise in my book compared to the beast of keeping a modern CPU and nvidia GPU on when idle. My electricity bill is unaffected.
./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"
I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.
The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.
Some of us like to experiment with new technology but don't physically own the kind of a hardware that is ideal for it. So yes, I've actually gotten passable results running on CPU (on a 2019 laptop at that)
true but as engineers we should sometimes sacrifice and live in the future a bit to maximize opportunity and in the future hardware will adapt to software as it always has
I'd love this to be true, and it might be for some specific well tested situations with a narrow set of data that you can be confident about. But that's a bit wishful, isn't it?
That is an example of a good, narrow task area that a small model could be good at, with current tech, which differs from general AI assistants like GPT-4. Using mixture of experts with task specific fine-tuning, I can see it being possible, but I was mainly saying Phi 2 ain't it. It may be a good starting place! Also, a code completion model could totally end up easily installed in a major Linux distro's default package manager soon, if not already.
Running one as a background desktop assistant is whole different animal than calling a Microsoft API.