Nitro: A fast, lightweight inference server with OpenAI-Compatible API

Art9681 · on Jan 6, 2024

These folks just came on the scene two weeks ago spamming reddit's LocalLLama with the next clone of a llama.cpp wrapper like its a new thing. None of the featureset here is unique. An experienced developer could replicate this in hours in any language. More time was spent on the different permutations of their landing page than the actual app. Just go to /r/localllama and search for jan.ai and witness yourself.

echelon · on Jan 6, 2024

Looks like they raised funding though. They're hiring.

Side-note: I really don't get the C++ in AI thing at all. It's becoming a meme. I wonder why they didn't go with Rust instead. It'll be easier to deploy to the edge as WASM, too.

WiSaGaN · on Jan 6, 2024

Although Rust is an amazing language with rapid adoption, it still has a relatively smaller user base in low-level programming, where llama.cpp operates. As a result, the pool of talent that can contribute to such projects would be more limited. Almost all low-level programmers can write in C++ if needed, as for C programmers, it's essentially like writing C with classes. Rust programmers, especially those who care about low-level details including throughput and latency, almost always have a background in C or C++. If llama.cpp were written in Rust, there would likely be far fewer contributors. Considering that one needs to be at least interested in deep learning to contribute, the fact that it currently has 476 contributors is impressive. [1] I think this is one the most important reasons the project can move so fast and be such an essential project in the LLM scene.

[1]: https://github.com/ggerganov/llama.cpp/graphs/contributors

brucethemoose2 · on Jan 6, 2024

Llama.cpp is a lot of C SIMD. I think using C++ elsewhere in the project just made sense in its infancy, when it was CPU only.

Rust is really interesting for stuff besides running the actual llm though. Python can be a huge performance/debugging pain when your frontend and such get huge.

jart · on Jan 6, 2024

> Rust is really interesting for stuff besides running the actual llm though.

That's the reason why llama.cpp is so attractive though. It just runs the LLM.

brucethemoose2 · on Jan 6, 2024

Thats not giving the project enough credit though, it implements a ton of very useful features outside the llm itself.

threecheese · on Jan 8, 2024

My understanding is the proliferation of “XYZ-cpp” AI frameworks is due to the c++ support in Apple’s gpu library ‘Metal’, and the popularity of apple silicon for inference (and there are a few technical reasons for this): https://developer.apple.com/metal/cpp/

jxy · on Jan 6, 2024

Because they didn't write llama.cpp.

z3phyr · on Jan 6, 2024

What does Rust offer here? More people know C++, security stuff from rust comes in the way of rapid exploratory development and arguably, C++ templates (don't think they are used here, moot point I know) are much more superior to rust.

mianos · on Jan 6, 2024

Seriously? 'But rust'.

v3ss0n · on Jan 6, 2024

This is just a wrapper wrapped with marketing. If you are looking for a proper infrence servers check:

easiest : ollama

best for servering : vLLM

OpenAI Compatible : LocalAI

moralestapia · on Jan 6, 2024

>best for servering : vLLM

Thanks, I was looking exactly for this!

pamelafox · on Jan 6, 2024

How does this compare to llamafile? I recently started experimenting with pointing my openAI web apps at local llamafile and its pretty straightforward: http://blog.pamelafox.org/2024/01/using-llamafile-for-local-...

Llamafile doesnt support an OpenAI compatible embedding endpoint yet but I filed an issue and they seem to be interested in adding it.

behnamoh · on Jan 6, 2024

I think drop-in replacements for OpenAI's API are an interesting idea but should be avoided in practice. I believe writing your own wrapper around llama.cpp using its special flags and using good types for your functions (something that the IDE can use to give suggestions) is the better approach because it reminds the user that they can't simply expect to change the model name and expect everything to work the same. There are all manners of model-specific "stuff" that the programmer needs to know about. From the prompt template to the max context window, whether the model is loaded in llama.cpp to use grammars or not, and what sort of log probs output one can expect from the endpoint....

To this end, I'm working on a simple library (that sadly doesn't have a shiny website like this post) and will make it public soon.

arthurcolle · on Jan 6, 2024

Refreshingly good take. As much as I like role/content, actually knowing how things get tokenized, processed is pretty invaluable to mastering LLMs as an applied ML practitioner

brucethemoose2 · on Jan 6, 2024

I disagree, for 2 reasons:

- OpenAI APIs expose the extra features (like llama.cpp's grammar) anyway, as extra parameters.

- There's a huge productivity benefit to making your backend swappable, without having to redo all the API calls.

Prompt formatting in particular is a huge sticker with OpenAI endpoints though. Something does need to be done about that.

sho_hn · on Jan 6, 2024

This sounds quite nice and like something I'd be interested in. C/C++?

behnamoh · on Jan 6, 2024

Thanks! I'm using Python but with heavy use of types (miss my c++ era).

I wanted to be able to call llama.cpp from Python, but I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds llama.cpp, which I didn't want to do. I like the simplicity of llama.cpp and its unix philosophy of doing one thing well, so I prefer to build llama.cpp myself and then call it from Python.

LoganDark · on Jan 6, 2024

> I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds llama.cpp

As in, after you've installed the package, it will pause your application at runtime to download and build llama.cpp? That's not how bindings are supposed to work! That sounds seriously cursed.

behnamoh · on Jan 6, 2024

I meant that when you install llama-cpp-python, it downloads and builds llama.cpp _for_ you, which I don't like because then you're stuck with llama.cpp commits that are kinda old.

sho_hn · on Jan 6, 2024

Roger. I'll keep an eye out, do post it on HN :-)

behnamoh · on Jan 6, 2024

Copy that. Will post it on HN in the next few days!

yawnxyz · on Jan 6, 2024

Very cool! Do you have a less shiny website or a github link :)

behnamoh · on Jan 6, 2024

Thanks for showing interest! I am going to finish up some final touches and soon will setup a less shiny website and make the github repo public.

Art9681 · on Jan 6, 2024

If anyone is interested in avoiding bloat, llama.cpp already includes an OpenAI compatible API:

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

brucethemoose2 · on Jan 6, 2024

There's also a native C++ implementation now.

...Unfortunately they have issues. The C++ version straight up ignores parameters like temperature, the python implementation does not support batching.

slig · on Jan 6, 2024

Dumb question: why having just 3MB is so important given that any good text model requires 10s of GB of disk and RAM?

adastra22 · on Jan 6, 2024

One possible reason, which I’m not sure applies here, is that a server that small can fit inside cpu cache, and thereby give very low latency responses (which also increases concurrency).

Obviously only relevant for non-inference API calls.

Art9681 · on Jan 6, 2024

By the time this is production ready it will no longer fit in that cache. There is a reason it is tiny. Notice how the majority of the features are planned. It wouldnt take much for an experienced engineer to simply deploy llama.cpp or one of the other inference backends directly themselves. llama.cpp already includes an openAI compatible API:

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

brucethemoose2 · on Jan 6, 2024

Oh nvm, you mean non inference calls.

...But still, these are not very common API calls? Generally an OpenAI endpoint is mostly inference calls, right? And llama.cpp's slowness is going to blow that advantage away.

andy99 · on Jan 6, 2024

I think the size is not very important but it can be a good measure of the dependencies. If something needs pytorch and other python stuff it's a more complex install than something that's stand alone. That said, llama.cpp (around which this is based), to run on a GPU still needs cuda toolkit (a 4 GB install?). On a mac I'm not so sure. So it's a bit of misdirection, unless we're only talking on CPU.

(Someone correct me if I'm wrong, I'm only familiar with building llama.cpp, can it run from just the binary without cuda?)

DarmokJalad1701 · on Jan 6, 2024

The original llama.cpp did not have CUDA support. It was a pure CPU binary using vector instructions for acceleration. IIRC it uses Apple's "Accelerate" framework for faster computation on the M-series CPUs.

ex3ndr · on Jan 6, 2024

also they include llama.cpp, i guess they just download it instead of bundling with binary like ollama.

IceWreck · on Jan 6, 2024

I recommend using https://ollama.ai/ if you dont care about openai compatibility.

flockonus · on Jan 6, 2024

In case this thread was posted by the author (or they are reading this comment), i'd strongly suggest adding the compensation and region that you expect people to be.

Fully possible the post was done in a hurry and you didn't think about it prior, but it's looking hand wavy which is not a good signal for experienced candidates you're trying to attract.

ex3ndr · on Jan 6, 2024

Just FIY, you have whisper.cpp in pipeline, but be advised that it is for some reason barely works. tiny model using transformers.js running via wasm outperforms any model of whisper.cpp. fast-whisper is much better here and as fast.

ilaksh · on Jan 6, 2024

AGPL is fully copyleft, even for SaaS. Which is fine if they want to go that route, but I think they should make the cost of a commercial license obvious on their website.

Because if you want to comply with AGPL then you can't build a product around this unless you negotiate a different license. Unless you want to release all of your own code. Which might be possible also.

But I would go with llama.cpp, ollama, candle (Rust), or whatever. Non of those are copyleft.

brucethemoose2 · on Jan 6, 2024

We just had a horrendous experience trying to deploy the example llama.cpp server for production. In addition to bugs, stability, and general fragility, when it did work, it was slow as molasses on an A100. And we spent a lot of effort sticking with it because of the grammar feature.

...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...

gessha · on Jan 6, 2024

Wasn’t it written for CPU only? A100 would better take an optimized PyTorch or Tensorflow implementation. Just trying to understand the logic here.

Havoc · on Jan 6, 2024

Originally cpu but can do gpu too

Havoc · on Jan 6, 2024

Does anything else support grammar?

lxe · on Jan 6, 2024

How does this specifically compare to llama.cpp, vllm and oobabooga? All of them already have openapi compatible API with some caveats.

p-e-w · on Jan 6, 2024

Ooba is a monstrosity with gigabytes of dependencies. I love it and use it daily, but if all you need is a compatible API, it's massive overkill.

bbor · on Jan 6, 2024

Not an author but the answer seems to involve their espoused focus on ease-of-use and “it-just-works” levels of modularity. See their list of plans, which include a few LLMs and a variety of adjacent tools/features.

It’s a basic value prop and I’m way behind the times on local inference (so barely a few months lol), but it seems different enough that I’m excited to try it. Other tools have tried this path but I’ve seen none with a demo page quite this convincing…

adaboese · on Jan 6, 2024

I cannot really wrap my head around what is this. Is this the whole model? Is this some sort of middleware between user and OpenAI?

sho_hn · on Jan 6, 2024

It's someone taking llama.cpp, strapping it to a HTTP server lib and implementing the OpenAI REST API, then putting some effort into a shiny website and docs, because AI-anything is a reputation and/or investment magnet right now.

Not that it isn't a nice idea -- making it easy to test existing OpenAI-based apps against competing open models is a pretty good thing.

toddmorey · on Jan 6, 2024

Or… it’s just someone who wanted to build cool stuff.

I see the value in being able to locally run apps designed to talk to OpenAI apis.

brucethemoose2 · on Jan 6, 2024

There are dozens of other endpoints that will do this, and possibly do it better.

I'm all for cool innovations, but... I'm not really seeing the advantage here beyond the shiny website to attract VCs.

sp332 · on Jan 6, 2024

"Nitro is a drop-in replacement for OpenAI's REST API"

It's not a model, it puts an API on local models.

adaboese · on Jan 6, 2024

so I still need to download and host models myself.

I found that to be incredibly impractical. I tried to do it for my project AIMD, but the cost and quality just made absolutely no sense even with the top models.

bbor · on Jan 6, 2024

Well, the market for local inference is already quite large, to say the least. “It didn’t pencil out in my business favor” doesn’t seem like a fair criticism, especially for an app clearly focused on the hobbyist—>SMB market where compute costs are dwarfed by the costs of wages and increased mental load.

I definitely see your specific point tho, and have found the same for high-level usecases. Local models become really useful when you need smaller models for ensemble systems, to give one class of use case you might want to try out —- e.g. proofreading, simple summarization, tone detection, etc.

gmerc · on Jan 6, 2024

Our you could use something like omnitool (https://github.com/omnitool-ai/omnitool) and interface with both cloud and local AI, not limited to llms.

blinky88 · on Jan 6, 2024

Not to be confused with https://nitro.unjs.io the server tech behind Nuxt and SolidStart

Terretta · on Jan 6, 2024

Nor AWS's lightweight hypervisor for advanced security, Nitro:

https://aws.amazon.com/ec2/nitro/

tamimio · on Jan 6, 2024

I love this! Will test it on a bunch of embedded robots servers, and would be more interested with the vision part later.

wokwokwok · on Jan 6, 2024

Look... I appreciate a cool project, but this is probably not a good idea.

> Built on top of the cutting-edge inference library llama.cpp, modified to be production ready.

It's not. It's literally just llama.cpp -> https://github.com/janhq/nitro/blob/main/.gitmodules

Llama.cpp makes no pretense at being a robust safe network ready library; it's a high performance library.

You've made no changes to llama.cpp here; you're just calling the llama.cpp API directly from your drogon app.

Hm.

...

Look... that's interesting, but, honestly, I know there's this wave of "C++ is back!" stuff going on, but building network applications in C++ is very tricky to do right, and while this is cool, I'm not sure 'llama.cpp is in c++ because it needs to be fast' is a good reason to go 'so lets build a network server in c++ too!'.

I mean, I guess you could argue that since llama.cpp is a C++ application, it's fair for them to offer their own server example with an openai compatible API (which you can read about here: https://github.com/ggerganov/llama.cpp/issues/4216, https://github.com/ggerganov/llama.cpp/blob/master/examples/...).

...but a production ready application?

I wrote a rust binding to llama.cpp and my conclusion was that llama.cpp is pretty bleeding edge software, and bluntly, you should process isolate it from anything you really care about, if you want to avoid undefined behavior after long running inference sequences; because it updates very often, and often breaks. Those breaks are usually UB. It does not have a 'stable' version.

Further more, when you run large models and run out of memory, C++ applications are notoriously unreliable in their 'handle OOM' behaviour.

Soo.... I know there's something fun here, but really... unless you had a really really compelling reason to need to write your server software in c++ (and I see no compelling reason here), I'm curious why you would?

It seems enormously risky.

The quality of this code is 'fun', not 'production ready'.

evantbyrne · on Jan 6, 2024

What is the deal with open-source projects advertising features that don't even exist? It's free software bro like chill with the grift.

ShamelessC · on Jan 6, 2024

These pages and even the github readme are more targeted at their investors first, users second.

evantbyrne · on Jan 6, 2024

Well it is a huge red flag to refer to items on the roadmap as if they are features, whether it is free software or has commercial intentions.

p-e-w · on Jan 6, 2024

This looks cool, but I paused when I saw that according to the curl examples, Nitro will accept a request specifying "gpt-3.5-turbo" as the model and then presumably just roll with it, using whatever local LLM has been loaded instead.

I hope this is a typo/mistake in the docs, because if not, that's a terrible idea. Nitro cannot serve GPT-3.5, so it should absolutely not pretend to. "Drop-in replacement" doesn't mean lying about capabilities. If that model is specifically requested, Nitro should return an error, not silently do something else instead.

sodality2 · on Jan 6, 2024

The reason for that is because most software that integrates with OpenAI will automatically choose that model - this is meant to snatch those requests up and serve an alternative. Most of the time, that software doesn't let you choose what model you want (but maybe lets you set the inference server).

But... I do agree, this should be feature-gated behavior

behnamoh · on Jan 6, 2024

That's one problem with these DiRs. Since the model needs to be loaded using llama.cpp (and llama.cpp uses model path), the DiR either needs to accept model path as model name (type mismatch), or just assume that your models are in a certain path and then receive model file name as model name (better). But that means the DiR needs to control the model loader (and in most cases they actually download and build that automatically after install). But I'd rather build my own llama.cpp and just have a nice interface to it that stays out of my way.

p-e-w · on Jan 6, 2024

That's true, but not my point. My point is that if the request specifies GPT-3.5, Nitro knows that it cannot possibly serve that model, so anything other than returning an error is simply lying to the client, which is a really bad idea.

trifurcate · on Jan 6, 2024

> which is a really bad idea.

Why?

p-e-w · on Jan 6, 2024

Because if the client specifically requests GPT-3.5, but is silently being served something else instead, the client will rely on having GPT-3.5 capabilities without them actually being available, which is a recipe for breakage.

brigadier132 · on Jan 6, 2024

You do understand that the client will be written by the same people setting up the inference server?

alpark3 · on Jan 6, 2024

Because it's lying to the client?

trifurcate · on Jan 7, 2024

And why is that bad?

Your mindset would mean that Windows would have next to no backwards compatibility, for instance.

vikp · on Jan 6, 2024

The size of the framework is not the most important factor - the model weights are usually 10x+ the size of the framework.

The most important factor is inference speed. For something called Nitro, I really expected speed benchmarks. I'd be interested in CPU, CUDA, and MPS at different batch sizes.

p-e-w · on Jan 6, 2024

AFAICT, Nitro is just a wrapper around llama.cpp. Therefore, you can simply look at llama.cpp benchmarks, of which there are plenty.

UnlockedSecrets · on Jan 6, 2024

Oobagooda and other front ends and similar projects have in my testing had upwards of a 50% difference in inference speed on the same model and settings, So benchmarks are still useful.

brucethemoose2 · on Jan 6, 2024

Ooba is an outlier, and has tons of overhead over llama.cpp and llama-cpp-python for some reason.

Most llama.cpp openai servers are pretty close to vanilla llama.cpp, albeit without the batching support.

burcs · on Jan 6, 2024

Going to got against the naysaying here. I understand that it might not follow best practices, but I can see a lot of benefit to this.

Imagine I want to test something fully offline, but my app was built using OpenAI's API's ideally I can do a quick swap now and have an offline compatible model.

Additionally, let's say I want to offer a free version of AI to my non-paying users, but for my paying users I can afford to allow them to use OpenAI, this will allow for an easier swap without having to rewrite everything.

laborcontract · on Jan 6, 2024

Exactly this.

There are many of us who are running solo operations and the only alternative to quick a/b testing in the broader scheme of your entire application follow is to either write your own wrappers or use some humongous horrendous abstraction layer like langchain, which is going to cost you 2x time in the future.

I’ve added support for multiple apis in my app but it’s definitely a non trivial task to test new ones, especially with so many on the market.

brucethemoose2 · on Jan 6, 2024

There are dozens of local openai endpoints already. The best one depends on your hardware and available time.

sergiotapia · on Jan 6, 2024

Did the designer from Linear who created this theme get a humongous bonus? I've seen so many copy-cat it's crazy! Like the Twitter Bootstrap 3 theme back in the day.

@Linear, you should give that designer a cool 50 g's bonus.

pm · on Jan 6, 2024

The CEO and co-founder, Karri Saarinen, is a designer, and was desigining it in the early days. I don't know about now, but the attention to detail is still there.

toddmorey · on Jan 6, 2024

Linear is a great example but hardly the first. This look is EVERYWHERE.