How to receive a million packets per second

adekok · on June 16, 2015

Nice, except recvmmsg() is broken.

http://man7.org/linux/man-pages/man2/recvmmsg.2.html

    The timeout argument does not work as intended.  The timeout is
    checked only after the receipt of each datagram, so that if up to
    vlen-1 datagrams are received before the timeout expires, but then no
    further datagrams are received, the call will block forever.

Which makes it useless for any application that wants to service data in a short time frame. The only way around it is to use a "self clocking" method. If you want to receive packets at least every 10ms, set a 10ms timeout... and then be sure to send yourself a packet every 10ms.

I've done similar tests with UDP applications. It's possible to get 500K pps on a multi-core system with a test application that isn't too complex, or uses too many tricks. The problem is that the system spends 80% to 90% of its time in the kernel doing IO. So you have no time left to run your application.

Another alternative is pcap and PF_RING, as seen here: https://github.com/robertdavidgraham/robdns

That might be useful. Previous discussion on robdns: https://news.ycombinator.com/item?id=8802425

justincormack · on June 16, 2015

The Snabb switch paralkelisation experiments are getting to 35 million packets per second doing real work (encap decap) in Linux userspace[1]

[1] https://groups.google.com/forum/m/#!topic/snabb-devel/_vKQgC...

ajross · on June 17, 2015

Tunnel encapsulation is real work, but not all real work can be mapped to tunnel encapsulation.

The point of all that work to context switch into processes to handle small amounts of network I/O is that very often THE CORRECT SOFTWARE ARCHITECTURE is for multiple address-space-separated processes to be doing small amounts of network I/O. That I/O "means something" to a larger data model being implemented by the software.

It's true that for some tasks that "look like routing" there's no point to having that kind of external data model. The packets are the data being operated on. So there's little value in process separation and you might as well DMA them all streamwise into a single process to do it. And that's great stuff, but AFAICT it's really not what the linked article is about.

Ultimately, all those packets are going to end up in conventional processes, because that's where conventional processing needs to happen. There are very good reasons why we like our page-protected address space separation in this world!

zurn · on June 17, 2015

> Nice, except recvmmsg() is broken. [...] The only way around it is to use a "self clocking" method.

Apparently you can also use the SO_RCVTIMEO socket option, which is a way to specify a timeout for all receive operations on the socket.

edude03 · on June 16, 2015

Hmm, I might be missing something here, but don't most high performance network applications skip the kernel for this exact reason? (IE http://highscalability.com/blog/2014/2/13/snabb-switch-skip-...)

Makes me wonder how often bypassing the kernel is used in production networked applications.

kev009 · on June 16, 2015

It's built into FreeBSD: https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4

Netmap is commonly used in HFT as well as packet filtering applications. I believe Verisign is running some of the root DNS servers with netmap as well, getting millions of connections per second.

bobmagoo · on June 16, 2015

Yeah it is, Robert Graham gave a great talk about it over at https://www.youtube.com/watch?v=D09jdbS6oSI with a writeup at http://c10m.robertgraham.com/2013/02/c10m-in-1990s.html

pjc50 · on June 16, 2015

He's "cheating" by using sendmmsg() to send many messages per system call, reducing the number of context switches.

amluto · on June 16, 2015

Barely. On a reasonably configured kernel (you need both syscall auditing and context tracking turned off, which is doable at compile time or runtime), a modern CPU should be able to round-trip a syscall in under 40 ns. That only eats 4% CPU at 1M syscalls per second.

(It's slightly worse than that due to extra cache and TLB pressure, but I doubt that matters in this workload.)

sj4nz · on June 16, 2015

There is a project "Kerlnel" http://kerlnel.org/ that is supposed to be Erlang instance running on bare metal. Site appears to be down and the Github https://github.com/kerlnel hasn't been touched since 2013. Then there is http://erlangonxen.org/ which puts Erlang on top of Xen instead of another operating system.

easytiger · on June 16, 2015

Well he should have compared this with onload enabled. He was using a solarflare after all. Limited benefits on mc compared with TCP though.

sbanach · on June 16, 2015

Yes, given the solarflare I was expecting the article to end up with a receiver coded against ef_vi, which exposes the NIC memory directly (but you have to do the IP/UDP yourself)

easytiger · on June 17, 2015

Yea, solarcapture is essentially, i think, ef_vi consumer off to another thread.

scurvy · on June 16, 2015

Lots of High Frequency Trading (HFT) applications use an userland based IP stack.

danpalmer · on June 17, 2015

> Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"

> They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box.

24 * 50,000 = 1,200,000

> we had shown that it is technically possible to receive 1Mpps on a Linux machine

So the original proposition was correct.

Anderkent · on June 17, 2015

AFAIK he's only using 4 out of the 24 cores:

> two cores busy with handling RX queues, and third running the application, it's possible to get ~650k pps

That's ~200k pps per core, so 4x the initial bet.

feld · on June 17, 2015

gnn has done a lot of research on *nix networking and the conclusion is "single sockets are always faster than multiple sockets". There's a huge performance hit when trying to process packets as fast as possible and you throw NUMA into the mix. Remote memory is accessed more slowly, and pinning work to specific CPUs is non-trivial.

shin_lao · on June 16, 2015

It's an interesting post.

If you really want to squeeze out all the performance of your network card, what you should use is something like DPDK.

http://dpdk.org/

LinuXY · on June 16, 2015

Right- Though the point of the post was to dispel the fallacy that the Linux kernel can only handle 50k pp/s per core. Using RDMA effectively bypasses the kernel. He's also testing with a SolarFlare card which doesn't support DPDK, though it does support RDMA with OpenOnload. What I've found is that RDMA is the "easy" part to get right, as it's fairly simple. Not every network card is created equal with respect to how many packets it can actually pass through to the kernel however, partially in part to the kernel driver (whether it's using NAPI or not, driver efficiency, MSI-X support, interrupt coalescing) and the card itself (onboard buffer, latency characteristics, etc.) 10g cards max out at around 14Mpp/s @ 60 bytes when the kernel is involved and everything is perfectly tuned. Which should be where the card he's using falls. A generic onboard Intel card generally maxes out around 8-10Mpp/s. But both would most likely be able to hit 16Mpp/s @ 60 bytes if using RDMA in any form.

shin_lao · on June 17, 2015

DPDK is great because it means telling to the customer "you keep your network, just buy a server with this card and we promise you unreal throughput".

If you really want maximum throughput with RDMA, I think the best is to go InfiniBand.

Galanwe · on June 17, 2015

...Except it's dead.

InfiniBand was the way to go 5/10 years ago, when 10G Ethernet was not there.

Nowadays, most of companies that invested in IB years ago are stuck with a dead infrastructure. It costs a lot, there is very little knowledge about the techno, and most support for it is dropping (e.g. glusterfs). Sad truth is most of these old IB infra are now used for IPoIB ...

Source: been working in HFT firms implementing IB RDMA, then GBEth RDMA, now proprietary NIC RDMA

shin_lao · on June 17, 2015

We've seen InfiniBand in HPC, it's true none of our customers in finance has it.

jedberg · on June 16, 2015

A joke answer and a serious question:

A: "Use BSD"

Q: Why is there such a strong focus on trying to get Linux network performance when (I think) everyone agrees BSD is better at networking? What does Linux offer beyond the network that BSD doesn't when it comes to applications that demand the fastest networks?

ps. I think the markdown filter is broken, I can't make a literal asterisk with a backslash. Anyone know how HN lets you make an inline asterisk?

jalons · on June 16, 2015

Tools. Compatibility. Resembles systems already in place (i.e., back when BSD failed to get SMP support "soon" there wasn't an option if your workspace was CPU bound). Numbers (it's popular in HPC/HFT because it's popular).

We're not consuming/using the built in network stacks anyway, we're using the OS as a content delivery system. Get us something we can get to the cores on, we're going to pin our applications directly to those cores keeping the kernel relegated to scheduling tasks on whichever NUMA core is further away from the PCIe bus running into the CPU. We'll have the CPU's we're using pegged in a constant spinlock anyway which would make the scheduler think really, really hard about running tasks there. We don't use realtime kernels as it's better for us to pay the price on the occasional spike outlier than to raise the baseline latency up.

Due to my own unfamiliarity, I don't know what BSD's equivalent to isolcpus is. I don't know how to taskset on a BSD. I don't know if the infinibad/ethernet controller's firmware/bypass software works. I don't know how BSD's scheduler works (not that we usually care, but there are times when one can't avoid work needing to be scheduled, things like rpc calls to shut down an app, ssh if you can spare the clock cycles for key verification, etc).

Would dtrace come in handy? Most definitely. Is that enough for us to abandon what we know works? Not yet.

jedberg · on June 16, 2015

This is probably the best argument: Familiarity.

However, we're talking about very specific needs: super high performance networking. If you have that specific of a need, wouldn't you want something unfamiliar if it solves the problem best?

jebblue · on June 16, 2015

If it's truly better and the only difference is remove Linux and install BSD then what is BSD doing that is better/different/messed up that the packets can flow faster in BSD than in Linux?

Matumio · on June 17, 2015

Talking about unfamiliarity and specific needs: FPGAs are much better suited than CPUs for processing minimum-sized frames at wirespeed. They can still forward all unhandled frames to a CPU. Yes, it's a lot of development effort compared to a CPU-only solution, but considering all the kernel-optimizing-multicore-cleverness from OP I would say we are approaching the break-even point.

jemfinch · on June 16, 2015

Who, in 2015, agrees that BSD is better at networking?

I remember these claims being made in the late 90s, and perhaps they were true back then, but it's been 15 years, and I would be surprised if Linux hasn't caught up by virtue of its faster development pace, greater mindshare, and increased corporate/datacenter usage.

So, in all seriousness: what recent, well argued essays/papers can you refer me to so I can understand the claim that BSD networking is still better than Linux in 2015?

jedberg · on June 16, 2015

There was a good discussion about it on reddit about 10 months back: http://www.reddit.com/r/linux/comments/2d5wzg/linux_network_...

It was also linked to from HN: https://news.ycombinator.com/item?id=8167126

Both have some pretty good sources (and some not so good sources too)

vezzy-fnord · on June 16, 2015

Facebook made the rounds last year for a job posting that stated the goal was "for the Linux kernel network stack to rival or exceed that of FreeBSD": http://www.phoronix.com/scan.php?page=news_item&px=MTc1NjY

tachion · on June 16, 2015

One thing - Netmap, that can give you speeds like ~10mpps. It was first developed for FreeBSD and there was a proposition for Linux to adopt the code, but for some reason they havent. Since then, Linux tries to catch up, but its not like FreeBSD/Netmap is staying in one place either.

justincormack · on June 16, 2015

No. Netmap is available for Linux too and there are other options like dpdk available on both. But that's not the FreeBSD network stack that's an Ethernet stack. The kernel IP stack apparently scales better in FreeBSD but I have not seen recent hard data. Netflix do stream all their content from FreeBSD though.

the-dude · on June 17, 2015

Whatsapp too IIRC

the-dude · on June 16, 2015

Facebook thinks so : http://bsd.slashdot.org/story/14/08/06/1731218/facebook-seek...

teacup50 · on June 16, 2015

http://info.iet.unipi.it/~luigi/netmap/

Shipped in FreeBSD by default, developed for FreeBSD first -- and that's just the bleeding edge side of things.

snoman · on June 16, 2015

Some of the very same reasons you give for the proposition that network performance having caught up can be used to argue that it may have slowed down (ie feature creep and bloat). So the other question is: who, in 2015 disagrees and what recent, well argued, essays/papers can you refer us to that might demonstrate that anything has changed at all.

martin1975 · on June 16, 2015

Here's an interesting kqueue v epoll benchmark I picked up somewhere when this topic came up.. http://daemonforums.org/showthread.php?t=2124

Time for a epoll for kqueue swap, and make this performance debate go away for both, Linux and FreeBSD once and for all. No reason for this pissing contest.

trentnelson · on June 17, 2015

Registered I/O on Windows is about three to four decades ahead, conceptually.

(As in, the stuff that facilitates registered I/O is based on concepts that can be traced back to VMS, released in 1977. Namely, the Irp, plus, a kernel designed around waitable events, not runnable processes.)

15155 · on June 17, 2015

kqueue allows more sophisticated action to occur with each call.

epoll requires more syscalls to do the same stuff.

that's not responsible for the differences speed, but at this level syscalls are a meaningful expense.

est · on June 17, 2015

There isn't SO_SPLICE on linux. splice() needs a pipe.

nly · on June 17, 2015

Isn't SO_SPLICE a bit like sendfile() on Linux?

acdha · on June 16, 2015

Reading things like http://blog.erratasec.com/2013/02/custom-stack-it-goes-to-11... suggests that it's probably simply the case that the OS itself is sort of a side-issue once you need performance since you're going to be bypassing the normal network stack anyway.

At that point ease of management or package installation probably matters more to developers and it might simply come down to things like driver support and other stability/performance issues where Linux has gotten a LOT of highly-specialized attention from hardware vendors and the HPC world. Back when 10G hardware was just starting to enter the market, we bought cards which came with a Linux driver but it took awhile before that was ported to FreeBSD and longer still before the latency had been optimized as much.

scarmig · on June 16, 2015

Why is the BSD network stack superior, anyway? I hear it repeated a lot, and I'm just surprised Linux lags with all the attention it gets. Is it something related to kqueue vs epoll?

makmanalp · on June 16, 2015

The easy answer probably has to do with comfort and familiarity - no one wants to be in that situation where their software package isn't ported to BSD or the BSD port is lagging several version behind, or have a BSD-specific bug and have to track it down with the developers. Again, not saying these are necessarily realistic, but it's the thing that comes to mind first.

feld · on June 16, 2015

But thanks to the ports tree and the Porter's Handbook, it's easier to solve this type of problem on FreeBSD than on Linux.

And the chance of a Linux distro having a newer version of something important is unlikely unless a new distro was just released and picked the latest release of a particular software to standardize their package on for the lifetime of this new OS.

Either way -- FreeBSD will continue (or be capable of) getting updates to track upstream and your Linux distro will only be backporting security fixes.

acdha · on June 16, 2015

> But thanks to the ports tree and the Porter's Handbook, it's easier to solve this type of problem on FreeBSD than on Linux. > > Either way -- FreeBSD will continue (or be capable of) getting updates to track upstream and your Linux distro will only be backporting security fixes.

In most cases what actually happens is that people use the main distribution repo for the 99% of packages which are stable and when you need something newer you add external apt/yum/etc. sources for those specific projects.

In the Ubuntu world, there's a huge ecosystem supporting this style where you upload source packages and they'll build and host the binary packages for you:

https://help.launchpad.net/Packaging/PPA

This approach gives you the speed, reliability and security benefits of binary packages with the currency of ports and, more importantly, allows you to opt-in only where you specifically know you need new features.

feld · on June 17, 2015

And how can you trust those third party repos? That's the hard part. At least FreeBSD's way guarantees your packages are built from source code that matches the checksum of what upstream has released. If it doesn't match because of a compromise or upstream re-rolled their tarball it is discovered very, very quickly.

acdha · on June 17, 2015

All packages are signed using GPG and the source package definition include hashes of all of the dependencies so the only question of whether you trust a particular developer and you are required to add a GPG key before adding a repo. (The only thing which makes the distribution's repository special is that the distribution signing key ships as trusted in the base install.)

In many cases, the repos are maintained directly by the upstream project – see e.g. https://www.varnish-cache.org/installation/debian for what that process looks like.

In other cases, you have to decide whether you trust a particular developer. If not, you can choose to create your own version – which could be as simple as taking an source package, auditing it to whatever level you want, and signing it with your own trusted key.

Look, I ran servers using OpenBSD for years in the 90s and FreeBSD in the early 2000s. I respect the work which has gone into the ports system but the reason to use it is not security and advocacy based on limited understanding will not accomplish anything useful. If you want to praise ports, talk about how much easier it makes it to have the latest version of everything installed — and do your homework to be ready to explain how that's meaningfully better than e.g. a Debian user tracking the testing or unstable repositories.

feld · on June 19, 2015

It is meaningfully better. Try tracking the debian testing or unstable policies when the software you're trying to update is built against a newer version of a system library. Good luck fixing that. Now you have to update everything that also relies on that library. It's a rabbit hole that's not fun to go down. Even if they're GPG signed -- what does that prove? They could have modified the source when they built the package. At least with the ports tree you can easily verify the checksum of the source you're building with matches what upstream is releasing.

Furthermore, third party repos not by upstream or trusted OS developers is a nightmare. I regularly spend time trying to find trustworthy 3rd party repos to get newer versions of developer tools into RHEL 5/6. Sometimes it's just random RPMs on an FTP or rpmfind.net style sites. Not trustworthy at all. And sometimes I can't even build the package myself because the tool refuses to build because the rest of the OS is too old.

Long term release Linux distros make life hell.

alphapapa · on June 21, 2015

The way you said "Debian testing or unstable policies" suggests that you aren't very familiar with Debian (they are distributions or repositories, not policies). If the package you're updating is in testing or unstable, then its dependencies will also be updated as necessary. If other packages with the same dependencies won't work with the updated dependencies, then they will be updated as well, automatically. That is not always necessary; it depends on the library in question, whether the ABI changed, etc.

It sounds like you don't know how the Debian packaging system works regarding security. If you are installing from Debian repos (as opposed to third-party repos), then all binary packages go through the ftpmasters. The packages are checksummed, and the checksums are GPG-signed. Each package's maintainer or team handles building the binary from source. Of course, you can also download the source package yourself with a simple command, and then build it yourself. But if you don't trust the Debian maintainers to verify the integrity of the source packages they build, then you shouldn't be using Debian at all. This is no different than using a BSD. The ports tree could also be compromised.

Third-party repos are always a risk. That's one of the nice things about PPAs: their maintainers can use the same security mechanisms that the regular distro repos use, but with their own GPG keys. Again, if you don't trust the maintainer, I guess you should be building everything yourself. LFS gets old though, right?

And that's another good thing about Debian: almost anything you could want is already in Debian proper, so resorting to third-party repos or building manually is rarely necessary.

For long-term use, you can use testing or unstable or both, which are effectively rolling releases. There is also the backports repo for stable. And if you need to build a package yourself, between the Debian tools and checkinstall, it's not hard.

Debian and Ubuntu are the only distros I use, and for good reason. They solved most of these problems a long, long time ago. Compared to Windows or other Linux distros, it seems more like heaven than hell.

By the way, I'm no expert on BSDs--but do they even have any cryptographic signatures in the system at all, or is it just package checksums? Checksums by themselves don't prove anything; you need a way to sign the checksums to verify they haven't been altered. Relying on unsigned checksums is akin to security theater.

mohawk · on June 17, 2015

Have a look at Gentoo, it's only been around for 13 years...

feld · on June 17, 2015

I used to use it a lot. It's still probably the only Linux distro I'm happy with.

majke · on June 16, 2015

I have no idea. How many UDP packets can FreeBSD pass to application?

I don't think there is any particular problem with Linux. The core problem of "slow" performance is mostly due to limitations of BSD API (recvmmsg is a good example of a workaround), and due to the feature richness of Linux.

Also, doing the math: 2GHz / 350k pps = 5714 cycles per packet. 6k cycles to deliver the packet from NIC to one core, then pass it between cores and copy over to application. That ain't bad really.

jgrahamc · on June 16, 2015

Markdown? This is HN where we use whatever PG* decides we're allowed to use.

https://news.ycombinator.com/formatdoc

*Paul Graham

jedberg · on June 16, 2015

So in other words it's impossible to have two inline asterisks without a space after the first one. Oh well.

elevensies · on June 16, 2015

**

jedberg · on June 16, 2015

You indented, which means there is no formatting at all. You cheated. :)

fapjacks · on June 16, 2015

The Underhanded HN Contest.

sliverstorm · on June 16, 2015

There was a trick wrapping things in equals signs or something like that, but I can't recall...

chetanahuja · on June 16, 2015

"Why is there such a strong focus on trying to get Linux network performance when (I think) everyone agrees BSD is better at networking?"

Setting aside the fact that a typical fanboi comment is at the top of HN post, are you seriously contending that since one OS does a thing there's no need for other OS's to do the same thing?

the8472 · on June 16, 2015

Maybe one factor is subtle differences in system call behavior, cascading down the whole stack?

E.g. on freebsd I recently ran into an ENOBUFS that I've never seen on linux. Although man pages said that they could happen.

0xbadcafebee · on June 16, 2015

Replace "Linux" with "Windows", and "BSD" with "Linux".

Obviously it provides nothing extra, and certainly nothing better, when it comes to a server role. Other than support contracts. And support for 3rd-party apps. And hardware drivers/compatibility. And users that know how to operate it. And brand name recognition. And size of software development community, libraries, tools. And industry reliance on non-portable software (Docker).

Answer to your question: technical superiority is dwarfed by a more popular product.

trentnelson · on June 17, 2015

Are you saying the Linux network stack is superior to contemporary Windows?

If so, thanks for the aneurism. Windows is decades ahead, thanks to a fundamentally more performant kernel and driver architecture, the key parts of which have been present since NT. (And VMS, if you want to get picky.)

vacri · on June 16, 2015

> And industry reliance on non-portable software (Docker).

Docker is the new shiny kid on the block, but the industry is far from reliant on it. It's also a bit misleading to call the software non-portable in a Linux v FreeBSD debate - it's not that the software is non-portable, it's that it uses a feature that is not available in FreeBSD. In a contest about bragging rights, that's significant.

xorcist · on June 16, 2015

Define "better". The very few times in my life that networking performance mattered, I found Linux (2.2+) to be hard to beat as a general purpose stack.

If you need something specific with better performance than that, you should probably look at moving more of the network stack to your application layer.

bcook · on June 16, 2015

Linux supports more scheduling/QoS algorithms. Most of the recent (~10 years?) academic papers on packet queuing algorithms implement their algo in Linux.

nodata · on June 17, 2015

Exactly! How dare people try and improve Linux when they could use something else instead!

ninjakeyboard · on June 17, 2015

Is there any comparison you can provide? (purely out of personal interest?)

_ugfj · on June 16, 2015

Perhaps because I am not really a low level programmer it strikes me odd that "receive packet" is a call. I would expect to pass a function pointer to the driver and be called with the packet address every time one has arrived.

colanderman · on June 16, 2015

For your own sanity, you want some control over which thread is processing a packet, and at what point it does so. The easiest way to do this is to explicitly notify the kernel when you are in fact interested in another packet, and whether you wish your thread of control to be suspended until such a packet arrives if there is no next packet.

The JavaScript model of "call some callback when the currently-executing function returns" doesn't work, because C is not event-based. The embedded model of "interrupt whatever's happening and invoke some handler on the current thread of control" is just an absolute nightmare to deal with.

jerf · on June 16, 2015

"The JavaScript model of "call some callback when the currently-executing function returns" doesn't work,"

I can't prove it, but at this scale, I am guessing that the cost of constructing call frames to call into the callback will start to matter, too. Switching into an already-existing context is probably significantly cheaper. (Well, it's definitely significantly cheaper to switch into an existing C context than construct a Javascript call, but, that's sort of cheating. Or a low blow. Or something like that.)

TickleSteve · on June 17, 2015

Not even remotely... To setup for a call to a C function is a matter of a handfull of instructions. To setup a context switch is hundreds.

jerf · on June 17, 2015

If you're talking about kernel context switches, the JS has them too, and then has more instructions to set up the JS call stack to boot. Setting up a million JS call stacks is not necessarily trivial. A million here, a million there, soon you're talking real wall-clock time.

jsalit · on June 17, 2015

> The JavaScript model of "call some callback when the currently-executing function returns" doesn't work, because C is not event-based.

This is misleading. Whether or not a C program is event-based is not defined by the language; the language itself has the necessary features. On the flip side, Javascript is not necessarily confined to event-based operation - this is dependent on the engine it's running in.

trentnelson · on June 17, 2015

That's how I/O completion routines have worked on NT since its debut. (And, more recently, threadpool I/O callbacks since Vista, and RIO dequeue since Windows 8.)

It all comes back to how the kernel can get data back to the caller: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-....

Windows has a distinct advantage thanks to design decisions made in VMS by Cutler et al; namely, I/O via I/O request packets, and much better memory management.

POSIX systems are at a disadvantage because of the readiness-oriented nature of I/O, leaving the caller responsible for "checking back" when something can be done, versus "here, do, this, tell me when it's done". (https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...)

innguest · on June 16, 2015

Check out lines 74 to 87: https://gist.github.com/geyslan/5174296

  ; Preparing to listen the incoming connection (passive socket)
  ; int listen(int sockfd, int backlog);
  ; listen(sockfd, 0);

  mov eax, 102		; syscall 102 - socketcall
  mov ebx, 4		; socketcall type (sys_listen 4)

  ; listen arguments
  push 0		; backlog (connections queue size)
  push edx		; socket fd

  mov ecx, esp		; ptr to argument array

  int 0x80		; kernel interruption (this is the "receive packet" call along with MOVs above)

We in userland are the ones responsible for calling that interruption in another thread and deciding what that thread should call when it's done. The low-level stuff is only concerned with system calls, not managing them for us.

samstave · on June 16, 2015

I was curious to see how many pps our servers are handling...

We have an app server that currently handles 40K concurrent users per node. I get ~63pps only:

TX eth0: 42780 pkts/s RX eth0: 64676 pkts/s

TX eth0: 41570 pkts/s RX eth0: 63401 pkts/s

TX eth0: 41867 pkts/s RX eth0: 63697 pkts/s

TX eth0: 41585 pkts/s RX eth0: 63187 pkts/s

TX eth0: 40408 pkts/s RX eth0: 61912 pkts/s

TX eth0: 41445 pkts/s RX eth0: 63299 pkts/s

TX eth0: 41119 pkts/s RX eth0: 63186 pkts/s

TX eth0: 41502 pkts/s RX eth0: 63153 pkts/s

TX eth0: 40465 pkts/s RX eth0: 62118 pkts/s

TX eth0: 42105 pkts/s RX eth0: 63986 pkts/s

But this is utilizing 7 of 8 cores on each node... with CPU util very low.

brobinson · on June 17, 2015

Why even have netfilter ("iptables") loaded in the kernel at all? Won't those two rules still have to be evaluated for each packet even if the rules are saying not to do anything?

There are additional things at play here, too, including what the NIC driver's strategy for interrupt generation is and how interrupts are balanced across the available cores, whether there are cores dedicated to interrupt handling and otherwise isolated from the I/O scheduler, various sysctl settings, etc.

There's further gains here if you want to get really into it.

nikropht · on June 17, 2015

Actually the Linux kernel is rather fast if used right. The Mikrotik CCR1036 series routers have a 36 core tile CPU with each core running at 1.2ghz it can cram out 15 million pps. https://www.youtube.com/watch?v=UNwxAjJ4V4A

RouterOS is based on the Linux kernel see https://en.wikipedia.org/wiki/MikroTik

netman · on June 16, 2015

The Automattic guys did some testing a few years ago with better results on SolarFlare. I wonder where their testing ultimately ended up. https://wpneteng.wordpress.com/2013/12/21/10g-nic-testing/

zurn · on June 16, 2015

Where does the funny 50 kpps per core idea in the lead-in come from? This would mean falling far short of 1 gigE line rate with 1500 byte packets! This is is trivially disproven with everyday experience of anyone who's run scp over his home lan or crossover cable?

majke · on June 16, 2015

TCP != UDP. In TCP recv() gives you a buffer. In UDP recv() gives you a packet.

Good luck receiving 50k pps of UDP packets, doing some processing and _sending_ 50k pps responses back. It actually is hard.

zurn · on June 16, 2015

The lead-in wasn't talking about UDP. And the article is specifically showing you how to use the multiple-packets-per call APIs (sendmmsg et al) and hitting 350 kpps with simeple single-threaded code and UDP. Then 1.4 Mpps with tuning and parallelism.

And even if you were using the slow API, even with ping:

  [7 year old desktop]$ sudo ping -q -f -c 100000 localhost
  PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
  --- localhost.localdomain ping statistics ---
  100000 packets transmitted, 100000 received, 0% packet loss, time 954ms
  rtt min/avg/max/mdev = 0.004/0.004/0.195/0.002 ms, ipg/ewma 0.009/0.005 ms

Google around for some iperf or netperf results that people are generally getting with Linux.

bitL · on June 16, 2015

Excellent article! Thanks for sharing! I am glad to learn something new today! ;-)

known · on June 17, 2015

man ethtool

floridaguy01 · on June 17, 2015

You know what is cooler than 1 million packets per second? 1 billion packets per second.