The timeout argument does not work as intended. The timeout is
checked only after the receipt of each datagram, so that if up to
vlen-1 datagrams are received before the timeout expires, but then no
further datagrams are received, the call will block forever.
Which makes it useless for any application that wants to service data in a short time frame. The only way around it is to use a "self clocking" method. If you want to receive packets at least every 10ms, set a 10ms timeout... and then be sure to send yourself a packet every 10ms.
I've done similar tests with UDP applications. It's possible to get 500K pps on a multi-core system with a test application that isn't too complex, or uses too many tricks. The problem is that the system spends 80% to 90% of its time in the kernel doing IO. So you have no time left to run your application.
Tunnel encapsulation is real work, but not all real work can be mapped to tunnel encapsulation.
The point of all that work to context switch into processes to handle small amounts of network I/O is that very often THE CORRECT SOFTWARE ARCHITECTURE is for multiple address-space-separated processes to be doing small amounts of network I/O. That I/O "means something" to a larger data model being implemented by the software.
It's true that for some tasks that "look like routing" there's no point to having that kind of external data model. The packets are the data being operated on. So there's little value in process separation and you might as well DMA them all streamwise into a single process to do it. And that's great stuff, but AFAICT it's really not what the linked article is about.
Ultimately, all those packets are going to end up in conventional processes, because that's where conventional processing needs to happen. There are very good reasons why we like our page-protected address space separation in this world!
Netmap is commonly used in HFT as well as packet filtering applications. I believe Verisign is running some of the root DNS servers with netmap as well, getting millions of connections per second.
Barely. On a reasonably configured kernel (you need both syscall auditing and context tracking turned off, which is doable at compile time or runtime), a modern CPU should be able to round-trip a syscall in under 40 ns. That only eats 4% CPU at 1M syscalls per second.
(It's slightly worse than that due to extra cache and TLB pressure, but I doubt that matters in this workload.)
There is a project "Kerlnel" http://kerlnel.org/ that is supposed to be Erlang instance running on bare metal. Site appears to be down and the Github https://github.com/kerlnel hasn't been touched since 2013. Then there is http://erlangonxen.org/ which puts Erlang on top of Xen instead of another operating system.
Yes, given the solarflare I was expecting the article to end up with a receiver coded against ef_vi, which exposes the NIC memory directly (but you have to do the IP/UDP yourself)
> Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"
> They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box.
24 * 50,000 = 1,200,000
> we had shown that it is technically possible to receive 1Mpps on a Linux machine
gnn has done a lot of research on *nix networking and the conclusion is "single sockets are always faster than multiple sockets". There's a huge performance hit when trying to process packets as fast as possible and you throw NUMA into the mix. Remote memory is accessed more slowly, and pinning work to specific CPUs is non-trivial.
Right- Though the point of the post was to dispel the fallacy that the Linux kernel can only handle 50k pp/s per core. Using RDMA effectively bypasses the kernel. He's also testing with a SolarFlare card which doesn't support DPDK, though it does support RDMA with OpenOnload. What I've found is that RDMA is the "easy" part to get right, as it's fairly simple. Not every network card is created equal with respect to how many packets it can actually pass through to the kernel however, partially in part to the kernel driver (whether it's using NAPI or not, driver efficiency, MSI-X support, interrupt coalescing) and the card itself (onboard buffer, latency characteristics, etc.) 10g cards max out at around 14Mpp/s @ 60 bytes when the kernel is involved and everything is perfectly tuned. Which should be where the card he's using falls. A generic onboard Intel card generally maxes out around 8-10Mpp/s. But both would most likely be able to hit 16Mpp/s @ 60 bytes if using RDMA in any form.
InfiniBand was the way to go 5/10 years ago, when 10G Ethernet was not there.
Nowadays, most of companies that invested in IB years ago are stuck with a dead infrastructure. It costs a lot, there is very little knowledge about the techno, and most support for it is dropping (e.g. glusterfs).
Sad truth is most of these old IB infra are now used for IPoIB ...
Source: been working in HFT firms implementing IB RDMA, then GBEth RDMA, now proprietary NIC RDMA
Q: Why is there such a strong focus on trying to get Linux network performance when (I think) everyone agrees BSD is better at networking? What does Linux offer beyond the network that BSD doesn't when it comes to applications that demand the fastest networks?
ps. I think the markdown filter is broken, I can't make a literal asterisk with a backslash. Anyone know how HN lets you make an inline asterisk?
Tools. Compatibility. Resembles systems already in place (i.e., back when BSD failed to get SMP support "soon" there wasn't an option if your workspace was CPU bound). Numbers (it's popular in HPC/HFT because it's popular).
We're not consuming/using the built in network stacks anyway, we're using the OS as a content delivery system. Get us something we can get to the cores on, we're going to pin our applications directly to those cores keeping the kernel relegated to scheduling tasks on whichever NUMA core is further away from the PCIe bus running into the CPU. We'll have the CPU's we're using pegged in a constant spinlock anyway which would make the scheduler think really, really hard about running tasks there. We don't use realtime kernels as it's better for us to pay the price on the occasional spike outlier than to raise the baseline latency up.
Due to my own unfamiliarity, I don't know what BSD's equivalent to isolcpus is. I don't know how to taskset on a BSD. I don't know if the infinibad/ethernet controller's firmware/bypass software works. I don't know how BSD's scheduler works (not that we usually care, but there are times when one can't avoid work needing to be scheduled, things like rpc calls to shut down an app, ssh if you can spare the clock cycles for key verification, etc).
Would dtrace come in handy? Most definitely. Is that enough for us to abandon what we know works? Not yet.
However, we're talking about very specific needs: super high performance networking. If you have that specific of a need, wouldn't you want something unfamiliar if it solves the problem best?
If it's truly better and the only difference is remove Linux and install BSD then what is BSD doing that is better/different/messed up that the packets can flow faster in BSD than in Linux?
Talking about unfamiliarity and specific needs: FPGAs are much better suited than CPUs for processing minimum-sized frames at wirespeed. They can still forward all unhandled frames to a CPU. Yes, it's a lot of development effort compared to a CPU-only solution, but considering all the kernel-optimizing-multicore-cleverness from OP I would say we are approaching the break-even point.
Who, in 2015, agrees that BSD is better at networking?
I remember these claims being made in the late 90s, and perhaps they were true back then, but it's been 15 years, and I would be surprised if Linux hasn't caught up by virtue of its faster development pace, greater mindshare, and increased corporate/datacenter usage.
So, in all seriousness: what recent, well argued essays/papers can you refer me to so I can understand the claim that BSD networking is still better than Linux in 2015?
One thing - Netmap, that can give you speeds like ~10mpps. It was first developed for FreeBSD and there was a proposition for Linux to adopt the code, but for some reason they havent. Since then, Linux tries to catch up, but its not like FreeBSD/Netmap is staying in one place either.
No. Netmap is available for Linux too and there are other options like dpdk available on both. But that's not the FreeBSD network stack that's an Ethernet stack. The kernel IP stack apparently scales better in FreeBSD but I have not seen recent hard data. Netflix do stream all their content from FreeBSD though.
Some of the very same reasons you give for the proposition that network performance having caught up can be used to argue that it may have slowed down (ie feature creep and bloat). So the other question is: who, in 2015 disagrees and what recent, well argued, essays/papers can you refer us to that might demonstrate that anything has changed at all.
Time for a epoll for kqueue swap, and make this performance debate go away for both, Linux and FreeBSD once and for all. No reason for this pissing contest.
Registered I/O on Windows is about three to four decades ahead, conceptually.
(As in, the stuff that facilitates registered I/O is based on concepts that can be traced back to VMS, released in 1977. Namely, the Irp, plus, a kernel designed around waitable events, not runnable processes.)
Reading things like http://blog.erratasec.com/2013/02/custom-stack-it-goes-to-11... suggests that it's probably simply the case that the OS itself is sort of a side-issue once you need performance since you're going to be bypassing the normal network stack anyway.
At that point ease of management or package installation probably matters more to developers and it might simply come down to things like driver support and other stability/performance issues where Linux has gotten a LOT of highly-specialized attention from hardware vendors and the HPC world. Back when 10G hardware was just starting to enter the market, we bought cards which came with a Linux driver but it took awhile before that was ported to FreeBSD and longer still before the latency had been optimized as much.
Why is the BSD network stack superior, anyway? I hear it repeated a lot, and I'm just surprised Linux lags with all the attention it gets. Is it something related to kqueue vs epoll?
The easy answer probably has to do with comfort and familiarity - no one wants to be in that situation where their software package isn't ported to BSD or the BSD port is lagging several version behind, or have a BSD-specific bug and have to track it down with the developers. Again, not saying these are necessarily realistic, but it's the thing that comes to mind first.
But thanks to the ports tree and the Porter's Handbook, it's easier to solve this type of problem on FreeBSD than on Linux.
And the chance of a Linux distro having a newer version of something important is unlikely unless a new distro was just released and picked the latest release of a particular software to standardize their package on for the lifetime of this new OS.
Either way -- FreeBSD will continue (or be capable of) getting updates to track upstream and your Linux distro will only be backporting security fixes.
> But thanks to the ports tree and the Porter's Handbook, it's easier to solve this type of problem on FreeBSD than on Linux.
>
> Either way -- FreeBSD will continue (or be capable of) getting updates to track upstream and your Linux distro will only be backporting security fixes.
In most cases what actually happens is that people use the main distribution repo for the 99% of packages which are stable and when you need something newer you add external apt/yum/etc. sources for those specific projects.
In the Ubuntu world, there's a huge ecosystem supporting this style where you upload source packages and they'll build and host the binary packages for you:
This approach gives you the speed, reliability and security benefits of binary packages with the currency of ports and, more importantly, allows you to opt-in only where you specifically know you need new features.
And how can you trust those third party repos? That's the hard part. At least FreeBSD's way guarantees your packages are built from source code that matches the checksum of what upstream has released. If it doesn't match because of a compromise or upstream re-rolled their tarball it is discovered very, very quickly.
All packages are signed using GPG and the source package definition include hashes of all of the dependencies so the only question of whether you trust a particular developer and you are required to add a GPG key before adding a repo. (The only thing which makes the distribution's repository special is that the distribution signing key ships as trusted in the base install.)
In other cases, you have to decide whether you trust a particular developer. If not, you can choose to create your own version – which could be as simple as taking an source package, auditing it to whatever level you want, and signing it with your own trusted key.
Look, I ran servers using OpenBSD for years in the 90s and FreeBSD in the early 2000s. I respect the work which has gone into the ports system but the reason to use it is not security and advocacy based on limited understanding will not accomplish anything useful. If you want to praise ports, talk about how much easier it makes it to have the latest version of everything installed — and do your homework to be ready to explain how that's meaningfully better than e.g. a Debian user tracking the testing or unstable repositories.
It is meaningfully better. Try tracking the debian testing or unstable policies when the software you're trying to update is built against a newer version of a system library. Good luck fixing that. Now you have to update everything that also relies on that library. It's a rabbit hole that's not fun to go down. Even if they're GPG signed -- what does that prove? They could have modified the source when they built the package. At least with the ports tree you can easily verify the checksum of the source you're building with matches what upstream is releasing.
Furthermore, third party repos not by upstream or trusted OS developers is a nightmare. I regularly spend time trying to find trustworthy 3rd party repos to get newer versions of developer tools into RHEL 5/6. Sometimes it's just random RPMs on an FTP or rpmfind.net style sites. Not trustworthy at all. And sometimes I can't even build the package myself because the tool refuses to build because the rest of the OS is too old.
The way you said "Debian testing or unstable policies" suggests that you aren't very familiar with Debian (they are distributions or repositories, not policies). If the package you're updating is in testing or unstable, then its dependencies will also be updated as necessary. If other packages with the same dependencies won't work with the updated dependencies, then they will be updated as well, automatically. That is not always necessary; it depends on the library in question, whether the ABI changed, etc.
It sounds like you don't know how the Debian packaging system works regarding security. If you are installing from Debian repos (as opposed to third-party repos), then all binary packages go through the ftpmasters. The packages are checksummed, and the checksums are GPG-signed. Each package's maintainer or team handles building the binary from source. Of course, you can also download the source package yourself with a simple command, and then build it yourself. But if you don't trust the Debian maintainers to verify the integrity of the source packages they build, then you shouldn't be using Debian at all. This is no different than using a BSD. The ports tree could also be compromised.
Third-party repos are always a risk. That's one of the nice things about PPAs: their maintainers can use the same security mechanisms that the regular distro repos use, but with their own GPG keys. Again, if you don't trust the maintainer, I guess you should be building everything yourself. LFS gets old though, right?
And that's another good thing about Debian: almost anything you could want is already in Debian proper, so resorting to third-party repos or building manually is rarely necessary.
For long-term use, you can use testing or unstable or both, which are effectively rolling releases. There is also the backports repo for stable. And if you need to build a package yourself, between the Debian tools and checkinstall, it's not hard.
Debian and Ubuntu are the only distros I use, and for good reason. They solved most of these problems a long, long time ago. Compared to Windows or other Linux distros, it seems more like heaven than hell.
By the way, I'm no expert on BSDs--but do they even have any cryptographic signatures in the system at all, or is it just package checksums? Checksums by themselves don't prove anything; you need a way to sign the checksums to verify they haven't been altered. Relying on unsigned checksums is akin to security theater.
I have no idea. How many UDP packets can FreeBSD pass to application?
I don't think there is any particular problem with Linux. The core problem of "slow" performance is mostly due to limitations of BSD API (recvmmsg is a good example of a workaround), and due to the feature richness of Linux.
Also, doing the math: 2GHz / 350k pps = 5714 cycles per packet. 6k cycles to deliver the packet from NIC to one core, then pass it between cores and copy over to application. That ain't bad really.
"Why is there such a strong focus on trying to get Linux network performance when (I think) everyone agrees BSD is better at networking?"
Setting aside the fact that a typical fanboi comment is at the top of HN post, are you seriously contending that since one OS does a thing there's no need for other OS's to do the same thing?
Replace "Linux" with "Windows", and "BSD" with "Linux".
Obviously it provides nothing extra, and certainly nothing better, when it comes to a server role. Other than support contracts. And support for 3rd-party apps. And hardware drivers/compatibility. And users that know how to operate it. And brand name recognition. And size of software development community, libraries, tools. And industry reliance on non-portable software (Docker).
Answer to your question: technical superiority is dwarfed by a more popular product.
Are you saying the Linux network stack is superior to contemporary Windows?
If so, thanks for the aneurism. Windows is decades ahead, thanks to a fundamentally more performant kernel and driver architecture, the key parts of which have been present since NT. (And VMS, if you want to get picky.)
> And industry reliance on non-portable software (Docker).
Docker is the new shiny kid on the block, but the industry is far from reliant on it. It's also a bit misleading to call the software non-portable in a Linux v FreeBSD debate - it's not that the software is non-portable, it's that it uses a feature that is not available in FreeBSD. In a contest about bragging rights, that's significant.
Define "better". The very few times in my life that networking performance mattered, I found Linux (2.2+) to be hard to beat as a general purpose stack.
If you need something specific with better performance than that, you should probably look at moving more of the network stack to your application layer.
Linux supports more scheduling/QoS algorithms. Most of the recent (~10 years?) academic papers on packet queuing algorithms implement their algo in Linux.
Perhaps because I am not really a low level programmer it strikes me odd that "receive packet" is a call. I would expect to pass a function pointer to the driver and be called with the packet address every time one has arrived.
For your own sanity, you want some control over which thread is processing a packet, and at what point it does so. The easiest way to do this is to explicitly notify the kernel when you are in fact interested in another packet, and whether you wish your thread of control to be suspended until such a packet arrives if there is no next packet.
The JavaScript model of "call some callback when the currently-executing function returns" doesn't work, because C is not event-based. The embedded model of "interrupt whatever's happening and invoke some handler on the current thread of control" is just an absolute nightmare to deal with.
"The JavaScript model of "call some callback when the currently-executing function returns" doesn't work,"
I can't prove it, but at this scale, I am guessing that the cost of constructing call frames to call into the callback will start to matter, too. Switching into an already-existing context is probably significantly cheaper. (Well, it's definitely significantly cheaper to switch into an existing C context than construct a Javascript call, but, that's sort of cheating. Or a low blow. Or something like that.)
If you're talking about kernel context switches, the JS has them too, and then has more instructions to set up the JS call stack to boot. Setting up a million JS call stacks is not necessarily trivial. A million here, a million there, soon you're talking real wall-clock time.
> The JavaScript model of "call some callback when the currently-executing function returns" doesn't work, because C is not event-based.
This is misleading. Whether or not a C program is event-based is not defined by the language; the language itself has the necessary features. On the flip side, Javascript is not necessarily confined to event-based operation - this is dependent on the engine it's running in.
That's how I/O completion routines have worked on NT since its debut. (And, more recently, threadpool I/O callbacks since Vista, and RIO dequeue since Windows 8.)
Windows has a distinct advantage thanks to design decisions made in VMS by Cutler et al; namely, I/O via I/O request packets, and much better memory management.
POSIX systems are at a disadvantage because of the readiness-oriented nature of I/O, leaving the caller responsible for "checking back" when something can be done, versus "here, do, this, tell me when it's done". (https://speakerdeck.com/trent/pyparallel-how-we-removed-the-...)
; Preparing to listen the incoming connection (passive socket)
; int listen(int sockfd, int backlog);
; listen(sockfd, 0);
mov eax, 102 ; syscall 102 - socketcall
mov ebx, 4 ; socketcall type (sys_listen 4)
; listen arguments
push 0 ; backlog (connections queue size)
push edx ; socket fd
mov ecx, esp ; ptr to argument array
int 0x80 ; kernel interruption (this is the "receive packet" call along with MOVs above)
We in userland are the ones responsible for calling that interruption in another thread and deciding what that thread should call when it's done. The low-level stuff is only concerned with system calls, not managing them for us.
Why even have netfilter ("iptables") loaded in the kernel at all? Won't those two rules still have to be evaluated for each packet even if the rules are saying not to do anything?
There are additional things at play here, too, including what the NIC driver's strategy for interrupt generation is and how interrupts are balanced across the available cores, whether there are cores dedicated to interrupt handling and otherwise isolated from the I/O scheduler, various sysctl settings, etc.
There's further gains here if you want to get really into it.
Actually the Linux kernel is rather fast if used right.
The Mikrotik CCR1036 series routers have a 36 core tile CPU with each core running at 1.2ghz it can cram out 15 million pps. https://www.youtube.com/watch?v=UNwxAjJ4V4A
Where does the funny 50 kpps per core idea in the lead-in come from? This would mean falling far short of 1 gigE line rate with 1500 byte packets! This is is trivially disproven with everyday experience of anyone who's run scp over his home lan or crossover cable?
The lead-in wasn't talking about UDP. And the article is specifically showing you how to use the multiple-packets-per call APIs (sendmmsg et al) and hitting 350 kpps with simeple single-threaded code and UDP. Then 1.4 Mpps with tuning and parallelism.
And even if you were using the slow API, even with ping:
[7 year old desktop]$ sudo ping -q -f -c 100000 localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
--- localhost.localdomain ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 954ms
rtt min/avg/max/mdev = 0.004/0.004/0.195/0.002 ms, ipg/ewma 0.009/0.005 ms
Google around for some iperf or netperf results that people are generally getting with Linux.
http://man7.org/linux/man-pages/man2/recvmmsg.2.html
Which makes it useless for any application that wants to service data in a short time frame. The only way around it is to use a "self clocking" method. If you want to receive packets at least every 10ms, set a 10ms timeout... and then be sure to send yourself a packet every 10ms.I've done similar tests with UDP applications. It's possible to get 500K pps on a multi-core system with a test application that isn't too complex, or uses too many tricks. The problem is that the system spends 80% to 90% of its time in the kernel doing IO. So you have no time left to run your application.
Another alternative is pcap and PF_RING, as seen here: https://github.com/robertdavidgraham/robdns
That might be useful. Previous discussion on robdns: https://news.ycombinator.com/item?id=8802425