One possible reason, which I’m not sure applies here, is that a server that small can fit inside cpu cache, and thereby give very low latency responses (which also increases concurrency).
Obviously only relevant for non-inference API calls.
By the time this is production ready it will no longer fit in that cache. There is a reason it is tiny. Notice how the majority of the features are planned. It wouldnt take much for an experienced engineer to simply deploy llama.cpp or one of the other inference backends directly themselves. llama.cpp already includes an openAI compatible API:
...But still, these are not very common API calls? Generally an OpenAI endpoint is mostly inference calls, right? And llama.cpp's slowness is going to blow that advantage away.
Obviously only relevant for non-inference API calls.