These folks just came on the scene two weeks ago spamming reddit's LocalLLama with the next clone of a llama.cpp wrapper like its a new thing. None of the featureset here is unique. An experienced developer could replicate this in hours in any language. More time was spent on the different permutations of their landing page than the actual app. Just go to /r/localllama and search for jan.ai and witness yourself.
Looks like they raised funding though. They're hiring.
Side-note: I really don't get the C++ in AI thing at all. It's becoming a meme. I wonder why they didn't go with Rust instead. It'll be easier to deploy to the edge as WASM, too.
Although Rust is an amazing language with rapid adoption, it still has a relatively smaller user base in low-level programming, where llama.cpp operates. As a result, the pool of talent that can contribute to such projects would be more limited. Almost all low-level programmers can write in C++ if needed, as for C programmers, it's essentially like writing C with classes. Rust programmers, especially those who care about low-level details including throughput and latency, almost always have a background in C or C++. If llama.cpp were written in Rust, there would likely be far fewer contributors. Considering that one needs to be at least interested in deep learning to contribute, the fact that it currently has 476 contributors is impressive. [1] I think this is one the most important reasons the project can move so fast and be such an essential project in the LLM scene.
Llama.cpp is a lot of C SIMD. I think using C++ elsewhere in the project just made sense in its infancy, when it was CPU only.
Rust is really interesting for stuff besides running the actual llm though. Python can be a huge performance/debugging pain when your frontend and such get huge.
My understanding is the proliferation of “XYZ-cpp” AI frameworks is due to the c++ support in Apple’s gpu library ‘Metal’, and the popularity of apple silicon for inference (and there are a few technical reasons for this): https://developer.apple.com/metal/cpp/
What does Rust offer here? More people know C++, security stuff from rust comes in the way of rapid exploratory development and arguably, C++ templates (don't think they are used here, moot point I know) are much more superior to rust.
I think drop-in replacements for OpenAI's API are an interesting idea but should be avoided in practice. I believe writing your own wrapper around llama.cpp using its special flags and using good types for your functions (something that the IDE can use to give suggestions) is the better approach because it reminds the user that they can't simply expect to change the model name and expect everything to work the same. There are all manners of model-specific "stuff" that the programmer needs to know about. From the prompt template to the max context window, whether the model is loaded in llama.cpp to use grammars or not, and what sort of log probs output one can expect from the endpoint....
To this end, I'm working on a simple library (that sadly doesn't have a shiny website like this post) and will make it public soon.
Refreshingly good take. As much as I like role/content, actually knowing how things get tokenized, processed is pretty invaluable to mastering LLMs as an applied ML practitioner
Thanks! I'm using Python but with heavy use of types (miss my c++ era).
I wanted to be able to call llama.cpp from Python, but I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds llama.cpp, which I didn't want to do. I like the simplicity of llama.cpp and its unix philosophy of doing one thing well, so I prefer to build llama.cpp myself and then call it from Python.
> I didn't want to use the llama-cpp-python wrapper because it automatically downloads and builds llama.cpp
As in, after you've installed the package, it will pause your application at runtime to download and build llama.cpp? That's not how bindings are supposed to work! That sounds seriously cursed.
I meant that when you install llama-cpp-python, it downloads and builds llama.cpp _for_ you, which I don't like because then you're stuck with llama.cpp commits that are kinda old.
...Unfortunately they have issues. The C++ version straight up ignores parameters like temperature, the python implementation does not support batching.
One possible reason, which I’m not sure applies here, is that a server that small can fit inside cpu cache, and thereby give very low latency responses (which also increases concurrency).
Obviously only relevant for non-inference API calls.
By the time this is production ready it will no longer fit in that cache. There is a reason it is tiny. Notice how the majority of the features are planned. It wouldnt take much for an experienced engineer to simply deploy llama.cpp or one of the other inference backends directly themselves. llama.cpp already includes an openAI compatible API:
...But still, these are not very common API calls? Generally an OpenAI endpoint is mostly inference calls, right? And llama.cpp's slowness is going to blow that advantage away.
I think the size is not very important but it can be a good measure of the dependencies. If something needs pytorch and other python stuff it's a more complex install than something that's stand alone.
That said, llama.cpp (around which this is based), to run on a GPU still needs cuda toolkit (a 4 GB install?). On a mac I'm not so sure. So it's a bit of misdirection, unless we're only talking on CPU.
(Someone correct me if I'm wrong, I'm only familiar with building llama.cpp, can it run from just the binary without cuda?)
The original llama.cpp did not have CUDA support. It was a pure CPU binary using vector instructions for acceleration. IIRC it uses Apple's "Accelerate" framework for faster computation on the M-series CPUs.
In case this thread was posted by the author (or they are reading this comment), i'd strongly suggest adding the compensation and region that you expect people to be.
Fully possible the post was done in a hurry and you didn't think about it prior, but it's looking hand wavy which is not a good signal for experienced candidates you're trying to attract.
Just FIY, you have whisper.cpp in pipeline, but be advised that it is for some reason barely works. tiny model using transformers.js running via wasm outperforms any model of whisper.cpp. fast-whisper is much better here and as fast.
AGPL is fully copyleft, even for SaaS. Which is fine if they want to go that route, but I think they should make the cost of a commercial license obvious on their website.
Because if you want to comply with AGPL then you can't build a product around this unless you negotiate a different license. Unless you want to release all of your own code. Which might be possible also.
But I would go with llama.cpp, ollama, candle (Rust), or whatever. Non of those are copyleft.
We just had a horrendous experience trying to deploy the example llama.cpp server for production. In addition to bugs, stability, and general fragility, when it did work, it was slow as molasses on an A100. And we spent a lot of effort sticking with it because of the grammar feature.
...So I would not recommend a batched llama.cpp server to anyone, TBH, when there are a grab back of splended OpenAI endpoints like TabbyAPI, Aphrodite, LiteLLM, VLLM...
Not an author but the answer seems to involve their espoused focus on ease-of-use and “it-just-works” levels of modularity. See their list of plans, which include a few LLMs and a variety of adjacent tools/features.
It’s a basic value prop and I’m way behind the times on local inference (so barely a few months lol), but it seems different enough that I’m excited to try it. Other tools have tried this path but I’ve seen none with a demo page quite this convincing…
It's someone taking llama.cpp, strapping it to a HTTP server lib and implementing the OpenAI REST API, then putting some effort into a shiny website and docs, because AI-anything is a reputation and/or investment magnet right now.
Not that it isn't a nice idea -- making it easy to test existing OpenAI-based apps against competing open models is a pretty good thing.
so I still need to download and host models myself.
I found that to be incredibly impractical. I tried to do it for my project AIMD, but the cost and quality just made absolutely no sense even with the top models.
Well, the market for local inference is already quite large, to say the least. “It didn’t pencil out in my business favor” doesn’t seem like a fair criticism, especially for an app clearly focused on the hobbyist—>SMB market where compute costs are dwarfed by the costs of wages and increased mental load.
I definitely see your specific point tho, and have found the same for high-level usecases. Local models become really useful when you need smaller models for ensemble systems, to give one class of use case you might want to try out —- e.g. proofreading, simple summarization, tone detection, etc.
Llama.cpp makes no pretense at being a robust safe network ready library; it's a high performance library.
You've made no changes to llama.cpp here; you're just calling the llama.cpp API directly from your drogon app.
Hm.
...
Look... that's interesting, but, honestly, I know there's this wave of "C++ is back!" stuff going on, but building network applications in C++ is very tricky to do right, and while this is cool, I'm not sure 'llama.cpp is in c++ because it needs to be fast' is a good reason to go 'so lets build a network server in c++ too!'.
I wrote a rust binding to llama.cpp and my conclusion was that llama.cpp is pretty bleeding edge software, and bluntly, you should process isolate it from anything you really care about, if you want to avoid undefined behavior after long running inference sequences; because it updates very often, and often breaks. Those breaks are usually UB. It does not have a 'stable' version.
Further more, when you run large models and run out of memory, C++ applications are notoriously unreliable in their 'handle OOM' behaviour.
Soo.... I know there's something fun here, but really... unless you had a really really compelling reason to need to write your server software in c++ (and I see no compelling reason here), I'm curious why you would?
It seems enormously risky.
The quality of this code is 'fun', not 'production ready'.
This looks cool, but I paused when I saw that according to the curl examples, Nitro will accept a request specifying "gpt-3.5-turbo" as the model and then presumably just roll with it, using whatever local LLM has been loaded instead.
I hope this is a typo/mistake in the docs, because if not, that's a terrible idea. Nitro cannot serve GPT-3.5, so it should absolutely not pretend to. "Drop-in replacement" doesn't mean lying about capabilities. If that model is specifically requested, Nitro should return an error, not silently do something else instead.
The reason for that is because most software that integrates with OpenAI will automatically choose that model - this is meant to snatch those requests up and serve an alternative. Most of the time, that software doesn't let you choose what model you want (but maybe lets you set the inference server).
But... I do agree, this should be feature-gated behavior
That's one problem with these DiRs. Since the model needs to be loaded using llama.cpp (and llama.cpp uses model path), the DiR either needs to accept model path as model name (type mismatch), or just assume that your models are in a certain path and then receive model file name as model name (better). But that means the DiR needs to control the model loader (and in most cases they actually download and build that automatically after install). But I'd rather build my own llama.cpp and just have a nice interface to it that stays out of my way.
That's true, but not my point. My point is that if the request specifies GPT-3.5, Nitro knows that it cannot possibly serve that model, so anything other than returning an error is simply lying to the client, which is a really bad idea.
Because if the client specifically requests GPT-3.5, but is silently being served something else instead, the client will rely on having GPT-3.5 capabilities without them actually being available, which is a recipe for breakage.
The size of the framework is not the most important factor - the model weights are usually 10x+ the size of the framework.
The most important factor is inference speed. For something called Nitro, I really expected speed benchmarks. I'd be interested in CPU, CUDA, and MPS at different batch sizes.
Oobagooda and other front ends and similar projects have in my testing had upwards of a 50% difference in inference speed on the same model and settings, So benchmarks are still useful.
Going to got against the naysaying here. I understand that it might not follow best practices, but I can see a lot of benefit to this.
Imagine I want to test something fully offline, but my app was built using OpenAI's API's ideally I can do a quick swap now and have an offline compatible model.
Additionally, let's say I want to offer a free version of AI to my non-paying users, but for my paying users I can afford to allow them to use OpenAI, this will allow for an easier swap without having to rewrite everything.
There are many of us who are running solo operations and the only alternative to quick a/b testing in the broader scheme of your entire application follow is to either write your own wrappers or use some humongous horrendous abstraction layer like langchain, which is going to cost you 2x time in the future.
I’ve added support for multiple apis in my app but it’s definitely a non trivial task to test new ones, especially with so many on the market.
Did the designer from Linear who created this theme get a humongous bonus? I've seen so many copy-cat it's crazy! Like the Twitter Bootstrap 3 theme back in the day.
@Linear, you should give that designer a cool 50 g's bonus.
The CEO and co-founder, Karri Saarinen, is a designer, and was desigining it in the early days. I don't know about now, but the attention to detail is still there.