We just had a horrendous experience trying to deploy the example llama.cpp server for production. In addition to bugs, stability, and general fragility, when it did work, it was slow as molasses on an A100. And we spent a lot of effort sticking with it because of the grammar feature.
...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...
...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...