Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Faster FastAPI with simdjson and io_uring on Linux 5.19 (github.com/unum-cloud)
290 points by ashvardanian on March 6, 2023 | hide | past | favorite | 90 comments
A few months ago, I benchmarked FastAPI on an i9 MacBook Pro. I couldn't believe my eyes. A primary REST endpoint to `sum` two integers took 6 milliseconds to evaluate. It is okay if you are targeting a server in another city, but it should be less when your client and server apps are running on the same machine.

FastAPI would have bottleneck-ed the inference of our lightweight UForm neural networks recently trending on HN under the title "Beating OpenAI CLIP with 100x less data and compute". (Thank you all for the kind words!) So I wrote another library.

It has been a while since I have written networking libraries, so I was eager to try the newer io_uring networking functionality added by Jens Axboe in kernel 5.19. TLDR: It's excellent! We used pre-registered buffers and re-allocated file descriptors from a managed pool. Some other parts, like multi-shot requests, also look intriguing, but we couldn't see a flawless way to integrate them into UJRPC. Maybe next time.

Like a parent with two kids, we tell everyone we love Kernel Bypass and SIMD equally. So I decided to combine the two, potentially implementing one of the fastest implementations of the most straightforward RPC protocol - JSON-RPC. ~~Healthy and Fun~~ Efficient and Simple, what can be better?

By now, you may already guess at least one of the dependencies - `simdjson` by Daniel Lemiere, that has become the industry standard. io_uring is generally very fast, even with a single core. Adding more polling threads may only increase congestion. We needed to continue using no more than one thread, but parsing messages may involve more work than just invoking a JSON parser.

JSON-RPC is transport agnostic. The incoming requests can be sent over HTTP, pre-pended by rows of headers. Those would have to be POSTs and generally contain Content-Length and Content-Type. There is a SIMD-accelerated library for that as well. It is called `picohttpparser`, uses SSE, and is maintained by H2O.

The story doesn't end there. JSON is limited. Passing binary strings is a nightmare. The most common approach is to encode them with base-64. So we took the Turbo-Base64 from the PowTurbo project to decode those binary strings.

The core implementation of UJRPC is under 2000 lines of C++. Knowing that those lines connect 3 great libraries with the newest and coolest parts of Linux is enough to put a smile on my face. Most people are more rational, so here is another reason to be cheerful.

- FastAPI throughput: 3'184 rps. - Python gRPC throughput: 9'849 rps. - UJRPC throughput: -- Python server with io_uring: 43'000 rps. -- C server with POSIX: 79'000 rps. -- C server with io_uring: 231'000 rps.

Granted, this is yet to be your batteries-included server. It can't balance the load, manage threads, spell S in HTTPS, or call parents when you misbehave in school. But at least part of it you shouldn't expect from a web server.

After following the standardization process of executors in C++ for the last N+1 years, we adapted the "bring your runtime" and "bring your thread-pool" policies. HTTPS support, however, is our next primary objective.

---

Of course, it is a pre-production project and must have a lot of bugs. Don't hesitate to report them. We have huge plans for this tiny package and will potentially make it the default transport of UKV: https://github.com/unum-cloud/ukv




On a tangent, does anyone have code samples of parsing large JSON newline files with simdjson that don't fit into memory? I.e. where the file overall is many GB (more than 4) but individual documents in the top-level are not that big. Most of the code samples for simdjson assume you can load the whole file in memory.

I've got an idea for how you can do it with parse_many(json,window) and truncated_bytes() but it would be easier if there were just an example out there I could look at.

See pages like [0] and [1] that describe it should be possible but I am just not seeing (or yet able to produce myself) working code.

[0] https://github.com/simdjson/simdjson/issues/188

[1] https://github.com/simdjson/simdjson/blob/master/doc/iterate...


There is a simple way and there is a hard way.

Simple - memory-map the whole file with `mmap`. We also suggest using `madvice`, to inform the kernel that you will be accessing the data just once, strictly in the sequential order. And then simply give it to SIMDJSON. With `ondemand` parser, your memory usage will be proportional to the depth of the deepest tree, not their number.

Harder way, would be to parse file in chunks, locating continuous sequences that form a single document.

There may be more options, but I'd have to check the docs :)


Regarding the hard way, this little utility does a great job of splitting larger than memory JSON documents into collections of NDJSON files:

https://github.com/dolthub/jsplit


Thanks! Yeah I've seen both general advice but since I'm not incredibly familiar with the API I've had a hard time creating code that works.

If you have any code examples of either of those options I'd love to see them!

Edit: Also, I'm curious: what are the downsides to doing the mmap route?


Downsides - unpredictable latency for IO when the parser crosses a page boundary and encounters an unbacked page. Unpredictable IO issue size (might be smaller IOs than ideal). Unpredictable caching behavior (OS may drop the pages before the application is done with them, only to have to reread again).

You can mitigate this somewhat with explicit MAP_POPULATE and mlock.


Also, generally higher overhead, more memory use for very large files (the page tables are not free), and no way of handling errors.


> more memory use for very large files (the page tables are not free),

They're... kinda free (in terms of capacity). A PTE or PDE is 64 bits per page. Even with 4kB pages, this is 8/4096 = 0.2% overhead. Churning the page tables and TLB as you walk the file is expensive, or at least historically was expensive (used to be that removing an entry required flushing the entire TLB; I don't think that's true anymore).

If you can use MAP_HUGE_2MB, this drops off to 0.00038% and the TLB impact goes way down.

> no way of handling errors.

Yeah, this aspect is really significant. Thanks for mentioning it.


0.2% really isn't free. If you have a data set of 10 TB (one small-ish HDD right now, or one large SSD), that's ~20 GiB in page tables alone. If you can use 2MB pages, then sure, but I would assume that also limits your page-in granularity.


Oh, you're right. I wasn't remembering we needed PTEs to cover the data not paged in as well as the data in RAM. Sorry.


Well, here's an idea. Do you HAVE to have the file as a big json file? How did it end up like that?

When you source the data, can you instead output the overall doc as "part" files that are aligned along subdocument boundaries?

THen you don't have to solve that "find the subdocument terminator" on stream-back.

You are PROBABLY building the json doc via some DB dump, so just change the code that dumps the data.

If you need to reconstruct it as a single json doc for some reason, it is about as trivial as it gets to reconstruct the data as a single document.

Or programattically, a "file" object that just does the reconstruction on the fly from the source part files as it is read, that should be relatively trivial in lots of languages.


At my last job, we have had to deal with this. We had huge json files (compressed size of 20+ GB) and had to perform certain operations on the data. I found that rapidjson had a better (not perfect, just better) API to do this. I quickly wrote a little wrapper around it to process json data coming in as streams. This was a writeup about it: https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/ .

If there is any interest, I can ask if they can to open source it.


Absolutely interested, on my end at least. I wrote this to manage the transparency in coverage files: https://github.com/dolthub/data-analysis/tree/main/transpare... but I'm always looking for better techniques.

Edit: Oh wow, I see you used it on those exact files. How about that.


Ha! Thanks to you, Today I found out how big those uncompressed JSON files really are (the data wasn't accessible to me, so i shared the tool with my colleague and he was the one who ran the queries on his laptop): https://www.dolthub.com/blog/2022-09-02-a-trillion-prices/ .

And yep, it was more or less they way you did with ijson. I found ijson just a day after I finished the prototype. Rapidjson would probably be faster. Especially after enabling SIMD. But the indexing was a one time thing.

We have open sourced the codebase. Here's the link: https://github.com/multiversal-ventures/json-buffet . Since this was a quick and dirty prototype, comments were sparse. I have updated the Readme, and added a sample json-fetcher. Hope this is more useful for you.

Another unwritten TODO was to nudge the data providers towards a more streaming friendly compression formats - and then just create an index to fetch the data directly from their compressed archives. That would have saved everyone a LOT of $$$.


I used the rapidjson streams with my little embedded REST HTTP(s) server library: https://github.com/Edgio/is2/

We needed it for streaming large json from async server sockets.

Code link: https://github.com/Edgio/is2/blob/master/include/is2/support...

You just had to implement the interfaces like Peek/Take/Tell/etc. It worked really well for us.

Probably not as fast as simdjson, but they used some simd tricks I think for skipping whitespace:

https://rapidjson.org/md_doc_internals.html#SkipwhitespaceWi...


Just to be clear, are you talking about a file where each line is its own json object rather than the entire file being one large array?

If so, it’s easy in Python to do the following:

  with open(“file.json”) as src:
      for line in src:
          json.loads(line)


I'm talking about with simdjson. Lemire suggested reading line-by-line is not a good idea [0]. So I'm asking about the ideal approach using simdjson, not JSON parsers in general.

[0] https://github.com/simdjson/simdjson/issues/188#issuecomment...


We had to parse thousands of multi GB zipped jsons for financial data. I don't have any code but it involved (in C++) boost gzip to unpack chunks into 100MB blocks and the simdjson iterate_many function to parse the block into a stream. The json files were not line-delimited but every row had a newline (json: [\n{...},\n{...}\n....\n] ), so we had to clean the commas in order for simdjson to process it as if it were newline-delimited.

Whatever remained at the end of the buffer (few hundred bytes of incomplete json) we would copy just before the boost-unzip block so that there was a continuation of json data. We also reused the parser and string buffer for performance.

Also try to parse the json fields in the right order for fastest performance, if possible.


> Also try to parse the json fields in the right order for fastest performance, if possible.

What do you mean with this precisely, not sure I understand? What is the right order?

AFAIK, JSON attributes/keys are not ordered, so there is no "right" order, or order at all.


Agreed on jsons not being ordered.

The docs on simdjson mention that if you parse the fields in the order as they appear on file, then simdjson can do it in one iteration. If you get fields in random order, simdjson will have to loop the data a few times.

If your json on disk is

  { "field1": "val1", "field2": "val2" }
then it is faster to say (pseudo C++)

  simdjson::document_reference elem;
  std::string_view tmp;
 
  elem["field1"].get_string().get(tmp);
  elem["field2"]...
than the other way around. This works only if you know what you are looking for and it is consistent on disk. Note that you are reading from an iterator that consumes the value: getting a field twice gives an error; totally different than reading from a dict.

See also [1] under "Extracting Values: You can cast a JSON element to a native type..."

[1] https://github.com/simdjson/simdjson/blob/master/doc/basics....


Ah yes, that makes sense. Thanks a lot for explaining!


Though I have read a bit about simdjson I'm not super familiar with how it works, so please forgive me if this is a silly question.

Since simdjson is where a lot of the performance benefits come from here, would you need to compile the underlying C/C++ code on the machine that you're deploying on to make sure that simdjson is using the correct instruction set? Like what if the processor I'm compiling on has AVX-512 support, but the target machine for deployment doesn't? Does it just generate the machine code for all instruction sets and then can automatically choose to use the optimal instructios at runtime? Or does it just have to compile every time you pip install the package? I could see this being awkward if you're building a docker container perhaps, but maybe I'm just woefully uninformed.


This isn't a silly question at all, and it's quite complicated.

When compiling C/C++, with some compilers you are able to specify multiple architectures and provide a fat binary. It's been a few years now since I've worked on this sort of thing, but the Intel compiler for e.g. used to allow you to do something like this at the compilation stage

-march=avx,avx2,avx512

And at runtime, your processor would use the most recent instruction set. In practice, you don't always want to do this, since you produce fat binaries - i.e. every function call that's got vector instructions will have multiple versions, and the file size goes up, library is slower to load, etc... I can't remember if there was an overhead too to each function call. So instead of doing this, what is also common is to compile N versions of the library (one per instruction set), and then load the appropriate version at runtime. This you can do with any compiler, and is indeed what Intel do themselves with the MKL library - if you look into the package for it, you can see that you end up with multiple shared libraries, each with a suffix saying which instruction set it supports (e.g. libmkl_avx.so, libmkl_avx2.so, libmkl_avx512.so). I'm not familiar with simdjson so not sure which approach it takes.

Interestingly, if you use ARM processors, the SVE instruction set is designed so that you don't need to re-compile if a new processor family comes along with longer vector processing units.


Thanks, this is super helpful. I didn't know about that feature of the SVE instructions. I was curious and it looks like RISC-V takes it a step further even and is "Vector Length Agnostic"[1]. Pretty cool!

[1]: https://gms.tf/riscv-vector.html


The README mentions this: > Selects a CPU-tailored parser at runtime. No configuration needed.

https://github.com/simdjson/simdjson


Thank you, I missed that when I first looked at the README. Very cool if anyone else wants to take a look: https://github.com/simdjson/simdjson/blob/master/doc/impleme...


It looks like the vast majority of the benefit (1200us -> 85us) comes from using WebSockets instead of REST (HTTP and maybe other protocol overhead). 85us beats GRPC in their tests, so it’s likely adequate for many applications.


I'm not sure if FastAPI is the best competition for this library as it focuses on building REST APIs rather than JSON RPCs, but anyway great work!


What is the difference between RPC and API besides the URL & payload structure?


It's just semantics. "managing remote resources" or "calling remote procedures" is all just network calls, usually http.


Semantics are 90% of programming though. Learning syntax is simple, learning semantics is hard. Semantics structure how you think about problems and express solutions.


That's rather handwavy.

The question is about a performance benchmark. From a performance standpoint, rpc and rest are the same (http requests that execute code on the server).

The "semantics" you assign to one or the other don't change the technical details about implementation and performance.


But in this case there is almost no difference. Basically same payloads with a slightly different "command" format.


It's the same payloads with a slightly different command format if you structure your program in exactly the same way. The point is that you don't structure your program in exactly the same way in each case.


Semantics is 90% of life!


> it's all just data

> it's all just computers

> we're all just going to die anyway


REST is based around resources and kind of the polar opposite of RPC when it comes to writing APIs.


REST is literally RPC. You are making a remote procedure call to a URL. GET this resource. DELETE this resource. It's the same thing.

What is the difference between remote.perform('some-action', { payload }) and remote.POST('some-action', payload) ?


There is no difference: neither of those are REST.

Calling an URL with an action in the path name (rather than a resource name) is technically not done in REST -- the action is expressed through your HTTP verb.

Using GET/POST/DELETE directly does not mean you're doing REST - there's a whole set of rules and assumptions that come with it.


Is it though? Or is the difference mostly semantics? At then end of the day it's just barely different ways for issuing a "command".


Thank you! You are right, FastAPI doesn't promise performance. But on the other hand, they have super usable. We wanted to show, that libraries can be both usable and fast, hence the name of the post :)


They say ‘high performance’ on their homepage: https://fastapi.tiangolo.com/


you would not be crazy to think the "Fast" refers to performance


A common pitfall. FastAPI is orders of magnitude slower than Java or .NET. A simple expressjs API outperforms it, especially with concurrent users.


100x faster than FastAPI seems easy. I wonder how it compares to other fast Python libraries like Japronto[1] and non-Python ones too.

1 - https://github.com/squeaky-pl/japronto


It's hard to beat SimdJSON + io_uring. Other implementations should provide a port of the former or equivalent/faster implementation and too rely on io_uring, with epoll being a viable competitor in limited set of scenarios. I would expect Python being also a bottleneck here with other compiled languages (besides C/C++) that have good interop and ability to write vectorized code having an upper hand like Rust, C# or Go.


You are right! For the convenience of Python users, we have to introspect the messages and parse JSON into Python objects. Every member of every dictionary being allocated on heap.

To make it as fast as possible we don't use PyBind, NanoBind, SWIG, or any high-level tooling. Our Python bindings are a pure CPython integration. There is just no way to beat that combo, not that I know.

https://github.com/unum-cloud/ujrpc/blob/main/src/python.c


The difference would still be huge. It's exceptionally large even when we compare to the gRPC C++ server implementations.


How large? Also I'm not sure the gRPC C++ server implementations you've tested are the fastest. If you're comparing to FastAPI (which is more of an HTTP server framework) then you should also compare to what is at the top of https://www.techempower.com/benchmarks/#section=data-r21.


The net performance of your system is only as fast as the slowest part.

So, you optimized your C IO loop and it's really fast when there is not much Python...but soon as you add any Python, you'll be bottlenecking in Python and all that fastness wont matter...eg if you do some ORM SQL or any other thing that's more complex than `return data`


It's also interesting to compare energy usage. Intel has RAPL which allows you to measure the joules of a certain program or processor. We should be focusing not just on performance but also on environmental runtime impact.


Energy efficiency is mostly about finishing faster, so the CPU can go to low power state earlier.


I was just thinking how lousy a data format JSON is for tabular data. Would sending it down in CSV improve things at all?


Yes, we also constantly think about that! In the document collections of UKV, for example, we have interoperability between JSON, BSON, and MessagePack objects [1]. CSV is another potential option, but text-based formats aren't ideal for large scale transmissions.

One thing people do - use two protocols. That is the case with Apache Arrow Flight RPC = gRPC for tasks, Arrow for data. It is a viable path, but compiling gRPC is a nightmare, and we don't want to integrate it into our other libraries, as we generally compile everything from sources. Seemingly, UJRPC can replace gRPC, and for the payload we can continue using Arrow. We will see :)

[1]: https://github.com/unum-cloud/ukv/blob/main/src/modality_doc...


Are there any synergies with capnproto [1] or is the focus here purely on huge payloads?

I'm just an interested hobbyist when it comes to performant RPC frameworks but had some fun benchmarking capnproto for a small gamedev-related project and it was pretty awesome.

[1] https://capnproto.org/


If anything you'd probably want to send it in Arrow[1] format. CSV's don't even preserve data types.

[1]: https://arrow.apache.org/


arrow/feather is really the best format these days for tabular data transmission.

anyone who disagrees I’d be very interested to hear your thoughts on alternatives.


What about compression - is this part of arrow itself?


It's not part of Arrow, but Arrow is columnar so just a basic LZ4/ZSTD will work pretty well.


Arrow looks super complicated.

Are data types useful for data to/fro web/mobile clients? Encode type into the column header?


data types are absolutely helpful. when you know a column stores Float64 data, you don't have to write out float to base 10 and parse it back. You just dump the bytes.


Or parquet, for compression?


Arrow is meant as the “in-memory” dual to Parquet, which is meant as the “on disk serialisation format”.

Many parquet supporting libs will load Parquet files into an Arrow structure in memory for example.


No, parsing CSV is also pretty slow. You want some sort of length prefixed and ideally fixed width column format.


Parsing CSV doesn't have to be slow if you use something like xsv or zsv (https://github.com/liquidaty/zsv) (disclaimer: I'm an author). The speed of CSV parsers is fast enough that unless you are doing something ultra-trivial such as "count rows", your bottleneck will be elsewhere.

The benefits of CSV are:

- human readable

- does not need to be typed (sometimes, data in the raw such as date-formatted data is not amenable to typing without introducing a pre-processing layer that gets you further from the original data)

- accessible to anyone: you don't need to be a data person to dbl-click and open in Excel or similar

The main drawback is that if your data is already typed, CSV does not communicate what the type is. You can alleviate this through various approaches such as is described at https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql..., though I wouldn't disagree that if you can be assured that your starting data conforms to non-text data types, there are probably better formats than CSV.

The main benefit of Arrow, IMHO, is less as a format for transmitting / communicating but rather as a format for data at rest, that would benefit from having higher performance column-based read and compression


If a CSV has quoting (e.g. because the data contains comma or quote chars) aren't you effectively forced to parse it in a single thread?

See also: 'Why isn’t there a decent file format for tabular data?' https://news.ycombinator.com/item?id=31220841


Good point. Though, if we are talking about something coming down a network pipe, then that network connection will be serialized anyway and during the parsing process can be sharded or converted to another format or indexed or whatnot. I would still say that, a situation where anything non-trivial gets bottlenecked by the CSV parsing remains exceptionally low. If you are reading the entire file, then the difference between starting, say, 4 threads directly in positions 0/25/50/75 versus a single CSV reader that dispatches chunks of rows to 4 threads (or whatever N instead of 4) is probably nil.

It is true there will be exceptions-- such as if you know you only want to read the second half the file only. In that case CSV with quoting does not give you a direct way to find that halfway point without parsing the first half.

I suppose whether this is worth the other pros/cons will be situation-dependent. For my use cases, which are daily, CSV parsing speed, when using something like xsv or zsv, has just, by itself, never been a material concern/impact on performance.

Where I think the CSV parsing downside is much greater than the fact that it must be serial (but which as described above does not prevent parallelized processing), is in type conversion not just of numbers but in particular of dates-- it can be expensive to convert the text "March 6, 2023" to a date variable. However, if you have control over the format, you could just as easily printed that as an integer such as 44991 and reduces the problem to one of integer conversion. Which is still always going to be slower than a binary format, but isn't so bad performance wise.


If you start threads at positions 0/25/50/75 inside a CSV, how do you know if the characters at 25, 50 & 75 are inside or outside quoted data values? You could start at a carriage return, but that could also be inside quoting.


Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format. But what I am saying is that, if you could do that, then your performance difference will be negligible, compared to using a single thread that parses the CSV into rows and passes chunks of rows to 4 separate threads.

In fact, the single-thread parser approach (with multi-thread processing) might even be better, because it is not trying to access your hard disk in 4 places at the same time. Then again, if your threads are doing some non-trivial task with each row, then IO will not be your bottleneck either way.

Obviously starts to break down if you aren't reading the whole file and you wanted to start some meaningful portion of the way in and never process what comes before it. The point is, the benefit of being able to, effectively, implicitly shard a file without saving as separate files-- might not be as impactful in practice as in theory


>Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format.

My mistake, I misread your answer!


Slower than JSON?


Author of typedload here!

FastAPI relies on (not so fast) pydantic, which is one of the slowest libraries in that category.

Don't expect to find such benchmarks on the pydantic documentation itself, but the competing libraries will have them.

[0] https://ltworf.github.io/typedload/


pydantic 2.0 should arrive this Spring with lots of Rust enhancements.

https://docs.pydantic.dev/blog/pydantic-v2/


But it's designed by the same people whose .so binary module somehow manages to be slower than a .py script :)

Anyway we will see when it comes.

This week at work I was just appreciating that typedload just works with static type checkers without having to install plugins (and all the type errors of pydantic won't be reported unless you do install the plugin).


How does yyjson[0] compare to simdjson? Their benchmarks suggest it could be a positive.

[0] https://github.com/ibireme/yyjson


In a nutshell, we use both in UKV. simdjson is faster for read-only operations, but it won't help you create a new JSON. yyjson is the best library I have seen for creating/updating JSONs.


If you're primarily targeting Python as an application layer, you may also want to check out my msgspec library[1]. All the perf benefits of e.g. yyjson, but with schema validation like pydantic. It regularly benchmarks[2] as the fastest JSON library for Python. Much of the overhead of decoding JSON -> Python comes from the python layer, and msgspec employs every trick I know to minimize that overhead. </sales pitch>

[1]: https://github.com/jcrist/msgspec

[2]: https://github.com/TkTech/json_benchmark


https://www.unum.cloud/ujrpc

Is this supposed to be just a blank page, which is what I see?


Oh, yes, sorry. Corrected it on the GitHub page. New Doxygen documentation portal for all of our open-source libraries is in the works. Can't wait to share it :)


There are also some other blank pages like https://www.unum.cloud/ukv/ and https://www.unum.cloud/ukv/details


Yes, same story there :)


If you just want to experience the topic of the original post, go to https://github.com/unum-cloud/ujrpc.


Haha, 6ms to add two integers that's about 1000x slower than on my trusty old PDP-11. You can trust software people to undo half a century of hardware engineers efforts).


That was an end-to-end measurement of the RPC service, so you need to measure how long it took the card-reader to read the instructions and print out the result on the line printer ;-)


I'm curious if json + simdjson will outperform protobuf.

I'm picking a protocol for my project. I was looking at protobuf, but this post made me think otherwise.


Yes, even though protobuf is a binary format, it is slower to parse than JSONs with simdjson. Counterintuitive, but that is the power hardware acceleration :)


Is it also the case for floating point data?

I know about fast_float (which is by the same author as simdjson), however my believe is that binary format is impossible to beat here.


Thank you! Yes, I saw a benchmark somewhere else that also says simdjson is faster than protobuf.


Most protobuf implementations are unfortunately not a good example of high-performance binary serialization code (but still better than the average JSON serialization library! And Go, C# (it provides protobuf almost out of box), Java and Rust implementations appear to be decent).

However, your use case is likely doesn't need to process multi-gigabit traffic in JSON requests. Therefore, protobuf, or more specifically, gRPC will be just fine.


Speaking of Go, there's a simdjson implementation for golang too:

> Performance wise, simdjson-go runs on average at about 40% to 60% of the speed of simdjson. Compared to Golang's standard package encoding/json, simdjson-go is about 10x faster.

I haven't tried it yet but I don't really need that speed.

https://github.com/minio/simdjson-go


simdjson is awesome (Lemire does a lot of great stuff) and io_uring is a good use of circular buffers to get efficient IO, but I always squirm a little when I see JSON in a high-performance context.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: