We just had a horrendous experience trying to deploy the example llama.cpp serve... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

brucethemoose2 on Jan 6, 2024 | parent | context | favorite | on: Nitro: A fast, lightweight inference server with O...

We just had a horrendous experience trying to deploy the example llama.cpp server for production. In addition to bugs, stability, and general fragility, when it did work, it was slow as molasses on an A100. And we spent a lot of effort sticking with it because of the grammar feature.

...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...

gessha on Jan 6, 2024 | [–]

Wasn’t it written for CPU only? A100 would better take an optimized PyTorch or Tensorflow implementation. Just trying to understand the logic here.

Havoc on Jan 6, 2024 | | [–]

Originally cpu but can do gpu too

Havoc on Jan 6, 2024 | [–]

Does anything else support grammar?

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact