This looks cool, but I paused when I saw that according to the curl examples, Ni...

sodality2 · on Jan 6, 2024

The reason for that is because most software that integrates with OpenAI will automatically choose that model - this is meant to snatch those requests up and serve an alternative. Most of the time, that software doesn't let you choose what model you want (but maybe lets you set the inference server).

But... I do agree, this should be feature-gated behavior

behnamoh · on Jan 6, 2024

That's one problem with these DiRs. Since the model needs to be loaded using llama.cpp (and llama.cpp uses model path), the DiR either needs to accept model path as model name (type mismatch), or just assume that your models are in a certain path and then receive model file name as model name (better). But that means the DiR needs to control the model loader (and in most cases they actually download and build that automatically after install). But I'd rather build my own llama.cpp and just have a nice interface to it that stays out of my way.

p-e-w · on Jan 6, 2024

That's true, but not my point. My point is that if the request specifies GPT-3.5, Nitro knows that it cannot possibly serve that model, so anything other than returning an error is simply lying to the client, which is a really bad idea.

trifurcate · on Jan 6, 2024

> which is a really bad idea.

Why?

p-e-w · on Jan 6, 2024

Because if the client specifically requests GPT-3.5, but is silently being served something else instead, the client will rely on having GPT-3.5 capabilities without them actually being available, which is a recipe for breakage.

brigadier132 · on Jan 6, 2024

You do understand that the client will be written by the same people setting up the inference server?

alpark3 · on Jan 6, 2024

Because it's lying to the client?

trifurcate · on Jan 7, 2024

And why is that bad?

Your mindset would mean that Windows would have next to no backwards compatibility, for instance.