One possible reason, which I’m not sure applies here, is that a server that smal...

Art9681 · on Jan 6, 2024

By the time this is production ready it will no longer fit in that cache. There is a reason it is tiny. Notice how the majority of the features are planned. It wouldnt take much for an experienced engineer to simply deploy llama.cpp or one of the other inference backends directly themselves. llama.cpp already includes an openAI compatible API:

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

brucethemoose2 · on Jan 6, 2024

Oh nvm, you mean non inference calls.

...But still, these are not very common API calls? Generally an OpenAI endpoint is mostly inference calls, right? And llama.cpp's slowness is going to blow that advantage away.