I’m really looking forwarding to seeing the original commenters reply on this. But I’ll share my experience too.
I’ve found UDP to be great for latency but pretty awful for throughout. Especially over longer routes (ie inter-region transports). Also, if you fire UDP packets out of a machine in a tight loop then there is every chance you could overload various buffers and just loose them (depending on the networking hardware).
TCP is comparatively amazing for throughput, but you do take a latency hit (especially on the initial handshake, which doesn’t exist for UDP).
There are some very experienced people commenting here though, and I’d be happy to be corrected or expanded upon.
Anecdotal, but I've some experience in running both TCP- and UDP-based VPN over long-latency links (I worked from half around the globe for some years).
With OpenVPN it's easy enough to test - configure for UDP, or configure for TCP. With long latency, and a tiny amount of packet losses, running TCP over TCP OpenVPN completely stalls, while TCP over UDP OpenVPN is excellent - it's around the same performance as running direct TCP, or sometimes actually better. At work we've also used other types of VPN setups (for engineers on the road), and the TCP based ones (we've used several) work fine most of the time, but if you try that from far away it becomes nearly unusable while UDP OpenVPN continues to work basically just fine.
The TCP over TCP VPN performance problem (over long latency links) presumably has to do with windowing and ack/nak on top of windowing with ack/nak.
The TCP over TCP performance problem can be summarized as follows:
Because the underlay TCP is lossless (being TCP), every time the overlay TCP has to retransmit, it adds to the queue of things that the underlay TCP has to retransmit (and the need to retransmit happens more or less at the same time).
So instead of linear increase in the number of packets, you get ~quadratic.
This balloons the required throughput needed to “rectify” the issue from the protocol standpoint at both levels - usually precisely at the point when there’s not enough capacity in the first place (the packet loss is supposed to signal congestion).
If you are very lucky, the link recovers fast enough that this ballooning is small enough to be absorbed by the newly available capacity.
If the outage is long enough, the rate of build-up of retransmits exceeds the capacity of the network to send them out - so it never recovers.
Needless to say, the issue is worse with large window in overlay TCP session - e.g. a sudden connectivity blip in the middle of the file transfer.
> I’ve found UDP to be great for latency but pretty awful for throughout.
UDP/multicast can provide excellent throughput. It's the de facto standard for market data on all major financial exchanges. For example, the OPRA feed (which is a consolidated market data feed of all options trading) can easily burst to ~17Gbps. Typically there is a "A" feed and a "B" feed for redundancy. Now you're talking about ~34Gbps of data entering your network for this particular feed.
Also, when network engineers do stress testing with iperf we typically use UDP to avoid issues with TCP/overhead.
That’s interesting. And I’m sure they have some very knowledgable people working for them who may(/will) know things I don’t.
That being said, it wouldn’t surprise me if they were pushing 17G of UDP on 100G transports. Probably with some pretty high-end/expensive network hardware with huge buffers. I.e you can do it if you’ve got the money, but I bet TCP would still have better raw throughput.
Yep, 100G switches are common nowadays since the cost has come down so much, and you can easily carve a port to 4x10G, 4x25G, and 40G. In financial trading you tend to avoid switches with huge buffer as that comes to a huge cost in latency. For example, 2 megabytes of buffer is 1.68ms of latency on a 10G switch which is an eon in trading. Most opt for cut-through switches with shallow buffers measured in 100s of nanoseconds. If you want to get really crazy there are L1 switches that can do 5ns.
That is a really good point that I hadn’t considered. Presumably this comes at the risk of dropped packets if the upstream link becomes saturated? Does one just size the links accordingly to avoid that?
Basically yes, but the links themselves are controlled by the exchanges (and tied in to your general contact for market access).
In general UDP is not a problem in the space because of overprovisioning. Think "algorithms are for people who don't know how to buy more RAM", but with a finicial industry budget behind it.
It’s actually pretty easy to monitor the throughout with the right tools. The network capture appliance I use can measure microbursts at 1ms time intervals. With low latency/cut through switches there are limited buffers by design. You are certain to drop packets if you are trying to subscribe to a feed that can burst to 17Gbps on a 10Gbps port.
Market data typically comes from the same RP per exchange in most cases. Some exchanges split them by product type. Typically there’s one or two ingress points (two for redundancy) into your network at a cross connect in a data center.
Have you tried to get inline-timestamping going on those fancy modern NICs that support PPT? Orders of magnitude cheaper than new ingress ports on that appliance whose name starts with a "C", also _really_ cool to have more perspectives on the network than switch monitor sessions.
UDP is little more than IP, so there isn't a technical reason why UDP couldn't be just as fast as TCP _per se_. But from when I was toying with writing a stream abstraction on top of UDP in Linux userspace, I came to the same conclusion, it's hard to achieve high throughput.
My guess is that this is in part because achieving high throughput on IP is hard and in part because it's never going to be super efficient at this level (in userspace, on top of kernel infrastructure that might not be as optimized towards throughput like it is in the case of TCP).
UDP is just a protocol. I’ve served millions - even billions - of people with UDP media delivery. I use it all the time for all my work communication (WireGuard)
I wouldn’t use it to ping my gateway though, or to join a multicast group, nor would I use it to establish my bgp session, I use icmp, igmp and tcp for that.
Amazon used UDP over multicast for request/response when sometimes the responses would be very large and implemented reliability on top of that through fall back to UDP unicast. This was all using Tibco RVD (taken from Bezos experience in Finance on the East Coast before Amazon I think).
The really key point there is probably the size of the responses, it wasn't just tiny atomic bits of stock information.
At one point as a system engineer I actually had to bump up the size of UDP socket data that the kernel would allowed to be sent across the entire production set of servers. SWEs were really hammering on UDP hard (the platform framework was sort of "sold" as being better than TCP though which doesn't have those kinds of limits).
The result was that one Christmas the traffic scaled up to the point that the switch buffers were routinely overflowing all the time. There was no slow start in UDP so the large payloads the SWEs were sending would go out as fast as the NICs could send them, which resulted in filling up packet buffers in the 6509s (Sup 720s I think at the time? Whatever it was the network engineers had already upgraded to whatever was Cisco's latest and greatest at the time and had tuned the switch buffers).
What made it even more fun was that as packets were dropped on the multicast routes the unicast replies created a bit of a bandwidth-amplification-attack. Then eventually the switch buffers started dropping IGMP packets, and if you drop enough of those in a row then IGMP sniffing fails and the multicast routes themselves start getting torn down. Now you get "packet loss" on one of the destination nodes which is complete. Then when it eventually rejoins it has fallen far behind all the peers (causing a bunch of issues when it was out of synch to begin with though) and then it requests more unicast messages to get caught up, creating even more of a flood of rapidly-sent UDP.
What I wound up doing is writing scripts to log into all the core switches and dump out the multicast tables and convert the IGMP snooped routes into static routes and reapply them. That let the multicast network grow as the site had to scale for Christmas, but kept all the routes in place and avoided the IGMP route flapping.
But even with that band-aid it still didn't work well and there was still high congestion and packet loss across the core switches. There were also problems with the CPU on the switches and Amazon drafted an extension to how multicast routing was done and got Cisco to implement it ("S,* routing" IDK if that's right its been 20 years). And it was a good job that the Network Engineers had ripped out spanning tree and gone L3 entirely since the packet loss and CPU congestion would have caused spanning tree to flap which would have amplified all the congestion issues. Eventually Tibco RVD was ripped out and a TCP-based gossip-based-clustering protocol was put into place.
So if you use UDP based stuff the datapackets need to be small, or else you need to throttle the senders somehow, and you need to not care about reliability. For stock ticker information it might work well, and for multimedia streaming where the protocol layer above it does slow start and congestion control. I suspect that if you dug up the network engineer responsible for those networks though that they could tell you stories about packet loss. If UDP works well at your company my suspicion is that you've either got a protocol sitting on top of UDP which implements at least half of what TCP offers, and/or you've got an overworked network engineer trying to keep it all together, and/or you just haven't scaled enough yet. I also wouldn't be too surprised if some wall street firms have switched to RDMA-over-Infiniband or something like that with link-layer and end-to-end credit-based based flow control[*] (as this paper points out, though, RMDA has issues itself and doesn't meet all the criteria for a TCP-replacement, but that would at least stop the packet loss issues due to buffers overflowing).
QUIC is a good example of what you need to do in order to use UDP (Section 4 of RFC 9000 is all about Flow Control to prevent fast senders from DoS'ing your network switches). But for the average HN/reddit reader who reads something about how TCP is awful and has the "showerthought" of wondering about why everyone doesn't just switch to UDP in the datacenter, they're missing a massive problem in that Ethernet has no flow control and just promiscuously drops packets everywhere, so if you thoughtless slap UDP on top of that your datacenter will absolutely have a meltdown. You need to use something like QUIC at a minimum.
And buried in what I wrote above is an observation that UDP multicast doesn't really solve reliable delivery across multiple servers and failover of streams that you'd like to be able to see, that's another solution which is simple and wrong (and which it looks like Homa is trying to address).
[*] On second thought they probably massively overprovision their network since mostly they just care in the extreme about latency at the expense of everything else (which is a very unusual use case).
UDP used successfully many places.