One thing about LLMs is that they are 6GB+ (and much larger for "smart" ones) ju...

oceanplexian · 2024-03-01T15:38:33 1709307513

At least with a GPU that can do power save that's not the case. I have a box with some 3090's in it, each card will idle <50W when it's not doing inference with the weights loaded into VRAM. Only when I ask it to do inference it will spin up and start consuming 300-400W.

ghthor · 2024-03-02T18:09:37 1709402977

I can confirm this, unless my llm is doing inference nvtop reports idle level wattage.

eurekin · 2024-03-01T15:20:46 1709306446

I imagine such assistant would be on top of support charts for "bogs down RAM" and "sucks up battery" for most users

diggan · 2024-03-01T15:37:37 1709307457

> One thing about LLMs is that they are 6GB+ (and much larger for "smart" ones) just sitting in the background. They suck power and produce heat like nothing else, and

Huh? That's not at all true. It's only using processing power (CPU) while it actually generates text, otherwise it sits and wait. Although yes it occupies memory (RAM or VRAM) if you don't unload it, but you can configure it to startup when you need it, and shut down when you don't.

zajio1am · 2024-03-01T15:47:59 1709308079

If one uses llama.cpp in CPU mode, then models are mmaped, so they do not really occupy memory when not used (they are just in reclaimable page cache).

diggan · 2024-03-01T15:54:16 1709308456

Do anyone actually use the CPU for anything besides testing? Last time I tried it, it was horribly slow compared to GPU that there wasn't really any point to use for me, besides getting access to more memory.

ingenieroariel · 2024-03-01T16:19:50 1709309990

I have.

On a Mac Studio with NixOS based Asahi Linux and 128Gb of RAM, mixtral 8x7b uses 49GB of RAM. At the same time I load airflow tasks that deal with world wide datasets (using ~60GB on 16 parallel streams with the performance cores) format is parquet and also mmaped.

Computer still has 8 efficiency cores and the whole GPU for visualizing the maps using lonboard / browsing / etc.

The computer uses 8-10W when idle, ~100W when running jobs or actively using the LLM and around ~200W when really using the GPU.

This makes it very efficient energy wise in my book compared to the beast of keeping a modern CPU and nvidia GPU on when idle. My electricity bill is unaffected.

diggan · 2024-03-01T17:14:20 1709313260

Interesting, thanks for sharing that! For curiosity, what kind of performance you get with that setup + mixtral 8x7b in terms of tokens/second?

ingenieroariel · 2024-03-01T23:46:01 1709336761

I just did:

./mixtral-8x7b-instruct-v0.1.Q8_0.llamafile --cli -t 16 -n 200 -p "In terms of Lasso"

I got 15 tokens per second for prompt evaluation and 8 tokens per second for regular eval.

The same hardware can run things much faster on OSX, or if you use more quantization but I prefer to run things at Q8 or f16 even if they are slow. In the future I how to use GPU, ANE and the crazy 1.58 or 0.68 bit quantization but for now this does the trick handsomely.

yjftsjthsd-h · 2024-03-01T16:31:18 1709310678

Some of us like to experiment with new technology but don't physically own the kind of a hardware that is ideal for it. So yes, I've actually gotten passable results running on CPU (on a 2019 laptop at that)

rvcdbn · 2024-03-01T18:04:26 1709316266

true but as engineers we should sometimes sacrifice and live in the future a bit to maximize opportunity and in the future hardware will adapt to software as it always has

dartharva · 2024-03-01T15:55:03 1709308503

You don't need the larger/smarter models for most "assistant" use cases. Small language models like Phi-2 can be enough.

luke-stanley · 2024-03-01T16:36:42 1709311002

I'd love this to be true, and it might be for some specific well tested situations with a narrow set of data that you can be confident about. But that's a bit wishful, isn't it?

ghthor · 2024-03-02T18:13:17 1709403197

I’m using StarCoder3B on my rtx 2080 using tabbyml and I consistently get good code suggestions.

luke-stanley · 2024-03-03T22:45:11 1709505911

That is an example of a good, narrow task area that a small model could be good at, with current tech, which differs from general AI assistants like GPT-4. Using mixture of experts with task specific fine-tuning, I can see it being possible, but I was mainly saying Phi 2 ain't it. It may be a good starting place! Also, a code completion model could totally end up easily installed in a major Linux distro's default package manager soon, if not already.

__loam · 2024-03-01T16:56:29 1709312189

I don't need any of these things shipping with my default distro.