> Drawing on existing research [3], our preliminary analysis of these programs and configurations suggests that the network stack architecture is somewhat similar to DPDK [4], mainly relying on a user-space C++ program to bypass the kernel for handling network packets.
The way it usually works is that the initial packets are handled in software but once the endpoints are established it flows through hardware. Sometimes certain patterns are always handled in software. The software could be a patched kernel or a XDP style kernel bypass.
Source: worked peripherally on an Intel Puma cable modem router/gateway that used DPDK or something like it. So I'm not 100% sure, but it is an educated guess.
Why would it be any less efficient than processing the packets in the kernel? There's a way to map the hardware queues into userspace (the article talks about the system being DPDK-like). At that point why does it matter that the polling code isn't in the kernel?
Most hardware >100Mbps has hardware offload - ie. the hardware is told which packets to send where, and software doesn't touch individual packets (except rare packets like ping).
In networking, it is the norm to measure performance in packets per second, so with small packets. Unless you're performing DPI or encryption, routers only use the headers to take routing decisions, so whether the payload is 10 bytes or 1000 bytes does not matter: the processing cost will be identical. Only the hardware bandwidth will matter for large packets, though this is rarely the issue (I've hit DDR4 limits once using XDP, and fixed by adding another stick of memory);
If one is doing 1Gbps of traffic which is 100 byte UDP packets, that's a million packets per second you're gonna need to process.
A 1Ghz CPU only then gets 1000 cycles to process each one...
Very doable, but certainly not easy unless your engineers like hand coding assembly and having to think about every lookup table trick in the book...