I discovered something similar in the early to mid 2010s: processes on different Linux systems communicated over TCP faster than processes on the same Linux system. That is, going over the network was faster than on the same machine. The reason was simple: there is a global lock per system for the localhost pseudo-device.
Processes communicating on different systems had actual parallelism, because they each had their own network device. Processes communicating on the same system were essentially serialized, because they were competing with each other for the same pseudo-device. At the time, Linux kernel developers basically said "Yeah, don't do that" when people brought up the performance problem.
That makes sense, Linux has many highly performant IPC options that don't involve the network device. Just the time it takes to correctly construct an ethernet frame is not negligible.
OpenOnload supports user-space acceleration of pipes. For our custom applications, we initiate connections over unaccelerated UNIX sockets and then "upgrade" them to accelerated pipes. It's like setting up pairs of shared-memory queues, but all operated via POSIX calls including the ability to watch them with epoll.
This a real old example of this technique using boost::asio [1]. We adapted that from the one in our own reactor library to Onload-pipe-accelerate some OSS that used ASIO.
> Just the time it takes to correctly construct an ethernet frame is not negligible.
For that reason, Onload/TCPDirect (and other vendors too) have APIs that allow you to pre-allocate and prepare packets (headers and payload), then you can do some final tweaks -- like choosing a ticker and price in an order packet -- and then send it out with minimal latency.
In fact, if I recall correctly, it doesn't seem to. I remember writing a tool that analyzed pcap dumps and initially assuming there would be Ethernet frames at the bottom, only to run into the fact that this wasn't true on lo, which had it's own frames, which I assume could vary by OS a bit.
There are many applications which are network transparent and would need extra code to know how to optimize when the other process happens to be on the same machine (e.g., using Sys V IPC or Unix domain sockets). There are good reasons to optimize the case of localhost.
I am always bemused when people imagine that communication over localhost involves packets hitting an ethernet adapter.
It has been many years since i tested this but unix sockets seemed slower to me than loopback TCP even
If you need real speed you should use UDP/MC, or I have used a lot of commercial middlewares which used "shared memory" transports. i.e. messaging paradigm APIs wrapping various configurable messaging middlewares to provide a fairly unified interface to the application messaging concepts.
>It has been many years since i tested this but unix sockets seemed slower to me than loopback TCP even
They're essentially the same... except they're not and there's a boatload of work in the middle that doesn't need to be done for unix domain sockets.
I can imagine some edge-cases however, where differences in stacks and configuration (in/out buffers vs. shared buffers, buffer sizes, packet loss vs. reliable, locking, blocking vs. non-blocking writes) may lead to more optimal scheduling under concurrent workloads. As soon as you have more than one thread doing something, there's a boatload of moving parts to consider.
But if you have everything properly tuned, unix domain sockets should win every time. If your results said something else, you probably weren't benchmarking what you thought you were benchmarking.
Shared memory of some kind will be fastest. It requires a re-architecture of the application, some of the times. Unfortunately, I can't find the Linux kernel mailing list response any more, but the kernel devs were not sympathetic to the use case of processes communicating over TCP on the same host.
Processes communicating on different systems had actual parallelism, because they each had their own network device. Processes communicating on the same system were essentially serialized, because they were competing with each other for the same pseudo-device. At the time, Linux kernel developers basically said "Yeah, don't do that" when people brought up the performance problem.