Hacker News new | past | comments | ask | show | jobs | submit login

We just had a horrendous experience trying to deploy the example llama.cpp server for production. In addition to bugs, stability, and general fragility, when it did work, it was slow as molasses on an A100. And we spent a lot of effort sticking with it because of the grammar feature.

...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...




Wasn’t it written for CPU only? A100 would better take an optimized PyTorch or Tensorflow implementation. Just trying to understand the logic here.


Originally cpu but can do gpu too


Does anything else support grammar?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: