A few months ago, I benchmarked FastAPI on an i9 MacBook Pro. I couldn't believe my eyes. A primary REST endpoint to `sum` two integers took 6 milliseconds to evaluate. It is okay if you are targeting a server in another city, but it should be less when your client and server apps are running on the same machine.
FastAPI would have bottleneck-ed the inference of our lightweight UForm neural networks recently trending on HN under the title "Beating OpenAI CLIP with 100x less data and compute". (Thank you all for the kind words!) So I wrote another library.
It has been a while since I have written networking libraries, so I was eager to try the newer io_uring networking functionality added by Jens Axboe in kernel 5.19. TLDR: It's excellent! We used pre-registered buffers and re-allocated file descriptors from a managed pool. Some other parts, like multi-shot requests, also look intriguing, but we couldn't see a flawless way to integrate them into UJRPC. Maybe next time.
Like a parent with two kids, we tell everyone we love Kernel Bypass and SIMD equally. So I decided to combine the two, potentially implementing one of the fastest implementations of the most straightforward RPC protocol - JSON-RPC. ~~Healthy and Fun~~ Efficient and Simple, what can be better?
By now, you may already guess at least one of the dependencies - `simdjson` by Daniel Lemiere, that has become the industry standard. io_uring is generally very fast, even with a single core. Adding more polling threads may only increase congestion. We needed to continue using no more than one thread, but parsing messages may involve more work than just invoking a JSON parser.
JSON-RPC is transport agnostic. The incoming requests can be sent over HTTP, pre-pended by rows of headers. Those would have to be POSTs and generally contain Content-Length and Content-Type. There is a SIMD-accelerated library for that as well. It is called `picohttpparser`, uses SSE, and is maintained by H2O.
The story doesn't end there. JSON is limited. Passing binary strings is a nightmare. The most common approach is to encode them with base-64. So we took the Turbo-Base64 from the PowTurbo project to decode those binary strings.
The core implementation of UJRPC is under 2000 lines of C++. Knowing that those lines connect 3 great libraries with the newest and coolest parts of Linux is enough to put a smile on my face. Most people are more rational, so here is another reason to be cheerful.
- FastAPI throughput: 3'184 rps.
- Python gRPC throughput: 9'849 rps.
- UJRPC throughput:
-- Python server with io_uring: 43'000 rps.
-- C server with POSIX: 79'000 rps.
-- C server with io_uring: 231'000 rps.
Granted, this is yet to be your batteries-included server. It can't balance the load, manage threads, spell S in HTTPS, or call parents when you misbehave in school. But at least part of it you shouldn't expect from a web server.
After following the standardization process of executors in C++ for the last N+1 years, we adapted the "bring your runtime" and "bring your thread-pool" policies. HTTPS support, however, is our next primary objective.
---
Of course, it is a pre-production project and must have a lot of bugs. Don't hesitate to report them. We have huge plans for this tiny package and will potentially make it the default transport of UKV: https://github.com/unum-cloud/ukv
I've got an idea for how you can do it with parse_many(json,window) and truncated_bytes() but it would be easier if there were just an example out there I could look at.
See pages like [0] and [1] that describe it should be possible but I am just not seeing (or yet able to produce myself) working code.
[0] https://github.com/simdjson/simdjson/issues/188
[1] https://github.com/simdjson/simdjson/blob/master/doc/iterate...