This looks cool, but I paused when I saw that according to the curl examples, Nitro will accept a request specifying "gpt-3.5-turbo" as the model and then presumably just roll with it, using whatever local LLM has been loaded instead.
I hope this is a typo/mistake in the docs, because if not, that's a terrible idea. Nitro cannot serve GPT-3.5, so it should absolutely not pretend to. "Drop-in replacement" doesn't mean lying about capabilities. If that model is specifically requested, Nitro should return an error, not silently do something else instead.
The reason for that is because most software that integrates with OpenAI will automatically choose that model - this is meant to snatch those requests up and serve an alternative. Most of the time, that software doesn't let you choose what model you want (but maybe lets you set the inference server).
But... I do agree, this should be feature-gated behavior
That's one problem with these DiRs. Since the model needs to be loaded using llama.cpp (and llama.cpp uses model path), the DiR either needs to accept model path as model name (type mismatch), or just assume that your models are in a certain path and then receive model file name as model name (better). But that means the DiR needs to control the model loader (and in most cases they actually download and build that automatically after install). But I'd rather build my own llama.cpp and just have a nice interface to it that stays out of my way.
That's true, but not my point. My point is that if the request specifies GPT-3.5, Nitro knows that it cannot possibly serve that model, so anything other than returning an error is simply lying to the client, which is a really bad idea.
Because if the client specifically requests GPT-3.5, but is silently being served something else instead, the client will rely on having GPT-3.5 capabilities without them actually being available, which is a recipe for breakage.
I hope this is a typo/mistake in the docs, because if not, that's a terrible idea. Nitro cannot serve GPT-3.5, so it should absolutely not pretend to. "Drop-in replacement" doesn't mean lying about capabilities. If that model is specifically requested, Nitro should return an error, not silently do something else instead.