Dumb question: why having just 3MB is so important given that any good text mode...

adastra22 · on Jan 6, 2024

One possible reason, which I’m not sure applies here, is that a server that small can fit inside cpu cache, and thereby give very low latency responses (which also increases concurrency).

Obviously only relevant for non-inference API calls.

Art9681 · on Jan 6, 2024

By the time this is production ready it will no longer fit in that cache. There is a reason it is tiny. Notice how the majority of the features are planned. It wouldnt take much for an experienced engineer to simply deploy llama.cpp or one of the other inference backends directly themselves. llama.cpp already includes an openAI compatible API:

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

brucethemoose2 · on Jan 6, 2024

Oh nvm, you mean non inference calls.

...But still, these are not very common API calls? Generally an OpenAI endpoint is mostly inference calls, right? And llama.cpp's slowness is going to blow that advantage away.

andy99 · on Jan 6, 2024

I think the size is not very important but it can be a good measure of the dependencies. If something needs pytorch and other python stuff it's a more complex install than something that's stand alone. That said, llama.cpp (around which this is based), to run on a GPU still needs cuda toolkit (a 4 GB install?). On a mac I'm not so sure. So it's a bit of misdirection, unless we're only talking on CPU.

(Someone correct me if I'm wrong, I'm only familiar with building llama.cpp, can it run from just the binary without cuda?)

DarmokJalad1701 · on Jan 6, 2024

The original llama.cpp did not have CUDA support. It was a pure CPU binary using vector instructions for acceleration. IIRC it uses Apple's "Accelerate" framework for faster computation on the M-series CPUs.

ex3ndr · on Jan 6, 2024

also they include llama.cpp, i guess they just download it instead of bundling with binary like ollama.