We need a replacement for TCP in the datacenter [pdf]

throwaway892238 · on Oct 31, 2022

Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

What's kind of hilarious about this paper is, these are just the network-layer problems! It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols. And then there's all the non-backend problems!

And that's just TCP. We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

alexgartrell · on Oct 31, 2022

> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is ridiculous.

Hyperscalars see an immediate ROI from efficiency/reliability improvements and actively invest in TCP alternatives all of the time. It's just really hard.

Networking companies see an ability to differentiate their products from their peers and work on this kind of thing as well. I did a 3 second google for "QUIC acceleration Mellanox" and got a hit on Nvidia's blog right away.

You just can't trivially replace something with an investment totally 50 years of clock time and thousands of years of engineer time. It will either take a long time or a massive shift in needs/technology. FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.

oconnor663 · on Oct 31, 2022

> 50 years of clock time and thousands of years of engineer time

It's not just the size of the investment, it's that it's the protocol everyone uses to talk to other people's machines, and you can't upgrade or replace other people's machines.

alexgartrell · on Oct 31, 2022

In this case we're talking about within the Datacenter, and you could conceivably update every network device and system to talk the new thing if you wanted. This is more achievable at a hyperscalar, where there tends to be < 3 distinct protocols, proxies, etc.

TCP gives you three things: 1. Reasonable performance - This is hard but not impossible to replicate 2. Reliability - This is very hard to replicate because networking edge cases are very hard to isolate 3. Fairness - this one is roughly impossible, because the "fairness" is an artifact of the experimentation and tweaking of Congestion Control Algorithms.

To elaborate on fairness, dynamic traffic control of all flows within a DC while maintaining high utilization is roughly impossible. You can get really close to this by picking your battles wisely (i.e. solid demand control for data warehouse workloads), but you'll always end up counting on individual flows to react appropriately to loss. They need to back off enough to make room for others without tanking their own throughput.

The people who design and implement these algorithms are definitely geniuses, but even they rely on TONS of empirical evidence to narrow parameters to what's appropriate. Of the Kernel Networking people I've worked with, Lawrence Brakmo had the most sophisticated network testing harness I've seen. Even then, you don't really know if it works (and can't finish tuning it) until you run it in production.

Running novel congestion control algorithms in production at a sufficient scale to figure out whether or not they're working appropriately is a great way to kill your network, so we end up conducting the equivalent of CCA drug testing to roll it out slowly and safely.

The end result of all of this is that it's really hard to solve the "arbitrary connections sharing arbitrary network topologies with high utilization" problem quickly enough for it ever to look like a breakthrough rather than just steady progress.

It's also worth noting that it's usually easiest to prove performance, so you'll see a lot of excitement about performance benchmarks from people who don't yet know what they're about to learn about networking. We were very much in this camp at Facebook when we were all-in on memcache-over-udp, and we later abandoned it completely.

lamontcg · on Oct 31, 2022

After having lived through Amazon's early (pre-2003ish) UDP-based networking I got a laugh around 2006-ish or so reading about how facebook was into UDP. I assume there are people who worked there who still have the scars.

tezza · on Oct 31, 2022

Do you have any specific problems you can elaborate with the UDP ?

UDP used successfully many places.

adamcharnock · on Oct 31, 2022

I’m really looking forwarding to seeing the original commenters reply on this. But I’ll share my experience too.

I’ve found UDP to be great for latency but pretty awful for throughout. Especially over longer routes (ie inter-region transports). Also, if you fire UDP packets out of a machine in a tight loop then there is every chance you could overload various buffers and just loose them (depending on the networking hardware).

TCP is comparatively amazing for throughput, but you do take a latency hit (especially on the initial handshake, which doesn’t exist for UDP).

There are some very experienced people commenting here though, and I’d be happy to be corrected or expanded upon.

Tor3 · on Oct 31, 2022

Anecdotal, but I've some experience in running both TCP- and UDP-based VPN over long-latency links (I worked from half around the globe for some years).

With OpenVPN it's easy enough to test - configure for UDP, or configure for TCP. With long latency, and a tiny amount of packet losses, running TCP over TCP OpenVPN completely stalls, while TCP over UDP OpenVPN is excellent - it's around the same performance as running direct TCP, or sometimes actually better. At work we've also used other types of VPN setups (for engineers on the road), and the TCP based ones (we've used several) work fine most of the time, but if you try that from far away it becomes nearly unusable while UDP OpenVPN continues to work basically just fine.

The TCP over TCP VPN performance problem (over long latency links) presumably has to do with windowing and ack/nak on top of windowing with ack/nak.

ay · on Oct 31, 2022

The TCP over TCP performance problem can be summarized as follows:

Because the underlay TCP is lossless (being TCP), every time the overlay TCP has to retransmit, it adds to the queue of things that the underlay TCP has to retransmit (and the need to retransmit happens more or less at the same time).

So instead of linear increase in the number of packets, you get ~quadratic.

This balloons the required throughput needed to “rectify” the issue from the protocol standpoint at both levels - usually precisely at the point when there’s not enough capacity in the first place (the packet loss is supposed to signal congestion).

If you are very lucky, the link recovers fast enough that this ballooning is small enough to be absorbed by the newly available capacity.

If the outage is long enough, the rate of build-up of retransmits exceeds the capacity of the network to send them out - so it never recovers.

Needless to say, the issue is worse with large window in overlay TCP session - e.g. a sudden connectivity blip in the middle of the file transfer.

Bluecobra · on Oct 31, 2022

> I’ve found UDP to be great for latency but pretty awful for throughout.

UDP/multicast can provide excellent throughput. It's the de facto standard for market data on all major financial exchanges. For example, the OPRA feed (which is a consolidated market data feed of all options trading) can easily burst to ~17Gbps. Typically there is a "A" feed and a "B" feed for redundancy. Now you're talking about ~34Gbps of data entering your network for this particular feed.

Also, when network engineers do stress testing with iperf we typically use UDP to avoid issues with TCP/overhead.

adamcharnock · on Oct 31, 2022

That’s interesting. And I’m sure they have some very knowledgable people working for them who may(/will) know things I don’t.

That being said, it wouldn’t surprise me if they were pushing 17G of UDP on 100G transports. Probably with some pretty high-end/expensive network hardware with huge buffers. I.e you can do it if you’ve got the money, but I bet TCP would still have better raw throughput.

Bluecobra · on Nov 1, 2022

Yep, 100G switches are common nowadays since the cost has come down so much, and you can easily carve a port to 4x10G, 4x25G, and 40G. In financial trading you tend to avoid switches with huge buffer as that comes to a huge cost in latency. For example, 2 megabytes of buffer is 1.68ms of latency on a 10G switch which is an eon in trading. Most opt for cut-through switches with shallow buffers measured in 100s of nanoseconds. If you want to get really crazy there are L1 switches that can do 5ns.

adamcharnock · on Nov 1, 2022

That is a really good point that I hadn’t considered. Presumably this comes at the risk of dropped packets if the upstream link becomes saturated? Does one just size the links accordingly to avoid that?

tiberious726 · on Nov 2, 2022

Basically yes, but the links themselves are controlled by the exchanges (and tied in to your general contact for market access).

In general UDP is not a problem in the space because of overprovisioning. Think "algorithms are for people who don't know how to buy more RAM", but with a finicial industry budget behind it.

tiberious726 · on Nov 2, 2022

Do the vendors actually convince anyone to buy those hubs rebranded as "ultra high performance 5ns L1 switches"?

kazen44 · on Oct 31, 2022

Multicast throughput is hard to measure because it is... well, multicast.

Depending on where your RP's are, and how you are transmitting multicast packets across a core, multicast performance can vary a lot.

The main advantage of multicast however, is that throughput between RP's doesn't need to be very large..

Bluecobra · on Nov 1, 2022

It’s actually pretty easy to monitor the throughout with the right tools. The network capture appliance I use can measure microbursts at 1ms time intervals. With low latency/cut through switches there are limited buffers by design. You are certain to drop packets if you are trying to subscribe to a feed that can burst to 17Gbps on a 10Gbps port.

Market data typically comes from the same RP per exchange in most cases. Some exchanges split them by product type. Typically there’s one or two ingress points (two for redundancy) into your network at a cross connect in a data center.

tiberious726 · on Nov 2, 2022

Have you tried to get inline-timestamping going on those fancy modern NICs that support PPT? Orders of magnitude cheaper than new ingress ports on that appliance whose name starts with a "C", also _really_ cool to have more perspectives on the network than switch monitor sessions.

jstimpfle · on Oct 31, 2022

UDP is little more than IP, so there isn't a technical reason why UDP couldn't be just as fast as TCP _per se_. But from when I was toying with writing a stream abstraction on top of UDP in Linux userspace, I came to the same conclusion, it's hard to achieve high throughput.

My guess is that this is in part because achieving high throughput on IP is hard and in part because it's never going to be super efficient at this level (in userspace, on top of kernel infrastructure that might not be as optimized towards throughput like it is in the case of TCP).

stevewatson301 · on Oct 31, 2022

You can use eBPF/DPDK these days for hardware offload.

nextaccountic · on Oct 31, 2022

What about QUIC? Do you think that HTTP/3 will suffer from throughput as well?

midasuni · on Oct 31, 2022

UDP is just a protocol. I’ve served millions - even billions - of people with UDP media delivery. I use it all the time for all my work communication (WireGuard)

I wouldn’t use it to ping my gateway though, or to join a multicast group, nor would I use it to establish my bgp session, I use icmp, igmp and tcp for that.

lamontcg · on Oct 31, 2022

Amazon used UDP over multicast for request/response when sometimes the responses would be very large and implemented reliability on top of that through fall back to UDP unicast. This was all using Tibco RVD (taken from Bezos experience in Finance on the East Coast before Amazon I think).

The really key point there is probably the size of the responses, it wasn't just tiny atomic bits of stock information.

At one point as a system engineer I actually had to bump up the size of UDP socket data that the kernel would allowed to be sent across the entire production set of servers. SWEs were really hammering on UDP hard (the platform framework was sort of "sold" as being better than TCP though which doesn't have those kinds of limits).

The result was that one Christmas the traffic scaled up to the point that the switch buffers were routinely overflowing all the time. There was no slow start in UDP so the large payloads the SWEs were sending would go out as fast as the NICs could send them, which resulted in filling up packet buffers in the 6509s (Sup 720s I think at the time? Whatever it was the network engineers had already upgraded to whatever was Cisco's latest and greatest at the time and had tuned the switch buffers).

What made it even more fun was that as packets were dropped on the multicast routes the unicast replies created a bit of a bandwidth-amplification-attack. Then eventually the switch buffers started dropping IGMP packets, and if you drop enough of those in a row then IGMP sniffing fails and the multicast routes themselves start getting torn down. Now you get "packet loss" on one of the destination nodes which is complete. Then when it eventually rejoins it has fallen far behind all the peers (causing a bunch of issues when it was out of synch to begin with though) and then it requests more unicast messages to get caught up, creating even more of a flood of rapidly-sent UDP.

What I wound up doing is writing scripts to log into all the core switches and dump out the multicast tables and convert the IGMP snooped routes into static routes and reapply them. That let the multicast network grow as the site had to scale for Christmas, but kept all the routes in place and avoided the IGMP route flapping.

But even with that band-aid it still didn't work well and there was still high congestion and packet loss across the core switches. There were also problems with the CPU on the switches and Amazon drafted an extension to how multicast routing was done and got Cisco to implement it ("S,* routing" IDK if that's right its been 20 years). And it was a good job that the Network Engineers had ripped out spanning tree and gone L3 entirely since the packet loss and CPU congestion would have caused spanning tree to flap which would have amplified all the congestion issues. Eventually Tibco RVD was ripped out and a TCP-based gossip-based-clustering protocol was put into place.

So if you use UDP based stuff the datapackets need to be small, or else you need to throttle the senders somehow, and you need to not care about reliability. For stock ticker information it might work well, and for multimedia streaming where the protocol layer above it does slow start and congestion control. I suspect that if you dug up the network engineer responsible for those networks though that they could tell you stories about packet loss. If UDP works well at your company my suspicion is that you've either got a protocol sitting on top of UDP which implements at least half of what TCP offers, and/or you've got an overworked network engineer trying to keep it all together, and/or you just haven't scaled enough yet. I also wouldn't be too surprised if some wall street firms have switched to RDMA-over-Infiniband or something like that with link-layer and end-to-end credit-based based flow control[*] (as this paper points out, though, RMDA has issues itself and doesn't meet all the criteria for a TCP-replacement, but that would at least stop the packet loss issues due to buffers overflowing).

QUIC is a good example of what you need to do in order to use UDP (Section 4 of RFC 9000 is all about Flow Control to prevent fast senders from DoS'ing your network switches). But for the average HN/reddit reader who reads something about how TCP is awful and has the "showerthought" of wondering about why everyone doesn't just switch to UDP in the datacenter, they're missing a massive problem in that Ethernet has no flow control and just promiscuously drops packets everywhere, so if you thoughtless slap UDP on top of that your datacenter will absolutely have a meltdown. You need to use something like QUIC at a minimum.

And buried in what I wrote above is an observation that UDP multicast doesn't really solve reliable delivery across multiple servers and failover of streams that you'd like to be able to see, that's another solution which is simple and wrong (and which it looks like Homa is trying to address).

[*] On second thought they probably massively overprovision their network since mostly they just care in the extreme about latency at the expense of everything else (which is a very unusual use case).

nsteel · on Nov 6, 2022

> Ethernet has no flow control

Isn't this what pause frames and Pfc are for? (Honest question)

nextstepguy · on Oct 31, 2022

Multicast storms happened regularly back in 2004

Bluecobra · on Oct 31, 2022

True, there were tons of crappy hardware still in production at that time. The first job I had out of college consisted of crappy 3Com hubs (not switches) so something like Norton Ghost could take down the whole network since multicast would get flooded everywhere. Nowadays this is a less of a problem as hubs are long gone and most switches have IGMP snooping by default and would only forward mutlicast frames that someone wants.

A bad client can still cause problems though, like sending a high rate of multicast packets with a TTL of 1.

lamontcg · on Nov 1, 2022

Amazon was definitely not run off of crappy 3com hubs, not even back then.

oconnor663 · on Oct 31, 2022

> In this case we're talking about within the Datacenter

Oh gotcha. It's right there in the title, but missed it somehow :p

kebman · on Oct 31, 2022

Yes you can. Just offer a better product, and people will buy it instead of the old or bad product. Better yet, make the new product backwards compatible, and fewer people will have qualms about forking out for it. Better yet, do an aggressive takeover, like Microsoft did, and just force the entire industry to adopt your stuff...

scantydolt · on Oct 31, 2022

Great! When do you think you'll have it done?

kebman · on Oct 31, 2022

Done? What do you mean "done"? Consulting hours are much better on projects that cannot ever be finished!1

atoav · on Oct 31, 2022

You mean like IPv6?

sneak · on Oct 31, 2022

I think QUIC/http2 is a much better example.

Google made that happen almost unilaterally via their Chrome dominance.

withinboredom · on Oct 31, 2022

I mean, this is how new features come about, for the most part (look at ajax, from Microsoft's IE dominance). The consortium allows anyone to contribute, not just the dominate browser, but the dominate browser will always be able to experiment with new web features without having to discuss it with anyone.

samgaw · on Oct 31, 2022

> FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.

Maybe we were just early in giving (HFT) customers RDMA back in ~2007[1][2] but I don't see it entering the mainstream anytime soon. And after a relatively short 20 years of adoption, the "next" thing for hyperscalers is not going to be the next thing for everyone else.

[1] https://downloads.openfabrics.org/Media/IB_LowLatencyForum_2...

[2] https://www.thetradenews.com/wombat-and-voltaire-break-milli...

pclmulqdq · on Oct 31, 2022

HFT networks are also a lot smaller than hyperscaler datacenters, and designed with more cross-sectional bandwidth. A good chunk of the traffic (trading-related messages) also tends to not use congestion control.

In large web company datacenters, RDMA and RoCE have had a much "rockier" path forward.

kortilla · on Oct 31, 2022

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is severe bullshit on two fronts:

- there is an immediate return on value - Google was driving this a decade+ ago for improvements in the data center (things like doubled+cancelable rpc, tcp cubic, quic, etc)

- academia constantly attempts to make these improvements as well because researchers are super incentivized to dethrone tcp for the glory. There are constant attempts to re-invent various layers (IP, tcp, the non existent upper layers of the OSI, etc) that come out of academic conferences every year.

The reason we’re still here is because our current stacks have been heavily optimized and tooled for production workloads. NICs can transparently re-assemble TCP segments for the OS and they can segment before transmit. You have to have a damn good value prop to throw away everything from software and hardware to careers and curriculum. It has to be a shitload better than the security nightmare of “return back each layer of the stack”.

twawaaay · on Oct 31, 2022

I don't think you realise why this is so hard.

The basic reason is that software at every level expects TCP/IP. And you can't drop in a translation layer because it will require at least the same amount of overhead as "real" TCP/IP.

It is not a local problem, it is a global problem that affects basically every single piece of non-trivial software in existence.

Even if you construct your datacenter with the new protocol you will run into problems that you can't run anything in it. Want Python? Sorry, have to rewrite it. And every Python library. And every Python application. Then you need to deal with problems that people who can run their scripts on their machines can't run them in datacenter. And so on.

The reason nobody wants to do this is that they would be investing huge amount of money to solve a problem for everybody else. Because the only way to make TCP/IP replacement work is to make it completely free and available to everybody.

There are much better ways to allocate your funds and precious top level engineers that let them distance themselves from competition temporarily.

bsder · on Oct 31, 2022

> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

And yet every time hardware designers get the chance they redesign Ethernet and IPv4--poorly.

See: HDMI 2.0+, USB 3.0+, Thunderbolt 3.0+, etc.

My suspicion is that this paper works fine beween pairs of peers and immediately goes straight to hell after that. It is extremely suspicious that there is zero mention of SCTP and only compares to TCP and not UDP.

The problem with RPC is that multiple organizations must agree on meaning. And that's just not going to fly. It is damn near a miracle that a huge number of institutions all agree on the Ethernet/IP command "Please take this bag of bytes closer to the machine named: <string of bytes>."

dooglius · on Oct 31, 2022

Why do you say that these protocols are worse than Ethernet/IPv4? I'm not intimately familiar with any at L2/L3, but I don't think any have hacks as bad as ARP. (USB does have some weirdness at L1 though I know.)

amluto · on Oct 31, 2022

I’ve generally considered IPv6 neighbor discovery to be a worse hack than ARP. ARP is a straightforward, fairly clean hack to layer the IPv4 addressing scheme over Ethernet, and it doesn’t pollute IPv4 itself. Neighbor discovery layers IPv6 on top of pseudo-IPv6, where the latter operates without knowledge of MAC addresses but nonetheless hardcodes knowledge of Ethernet. But hey, it eliminated the use of Ethernet broadcast in favor of a more complex but functionally identical multicast scheme.

dooglius · on Oct 31, 2022

Oh sure, the point is more that Ethernet/IP has to coordinate two separate ID spaces at all, whereas AFAIK no other packet-based protocol like the ones mentioned does this, so in that sense those protocols are better.

bsder · on Oct 31, 2022

> Why do you say that these protocols are worse than Ethernet/IPv4?

Here's an example: I connected my nice expensive audio interface to my Thunderbolt port. It worked great! Then I moved a window on my monitor and all hell broke loose. In spite of the fact that it had way more than enough bandwidth to handle everything.

See, Thunderbolt doesn't have the ability to say "This tiny packet going to there needs priority and you need to break up those giant display packets."

Ethernet has solved problems like these in standards. They're not always implemented on particular chipsets, but they exist, and you generally can buy a product that has them.

Everything Ethernet has done and standardized has generally been for a reason. If you don't implement Ethernet, then you are starting over from scratch and will have to reimplement all of that stuff.

And you're probably not smarter than the guys who did it for Ethernet.

(If I'm being charitable: what's happened is that lot of standards tried to be more cost optimized than Ethernet. The problem is that transistor prices keep coming down. Eventually the price delta between Ethernet and <whatever> becomes inconsequential and you're basically left with real Ethernet and "kinda crappy" Ethernet at almost the same price.)

idlehand · on Oct 31, 2022

Never thought about that before. Ethernet supports extremely high levels of data transmission. ISB C for intrgrated charging ND data transfer makes sense, but why are there HDMI cables?

GTP · on Oct 31, 2022

I think it would be an overkill to use a networking protocol to connect exactly two devices. Plus if you have something specific to video stream transfer you could maybe do some optimization specific to that use case, although I can't think of any at the moment.

zrail · on Oct 31, 2022

Excepting HDMI the parents examples are all networks with more than one peer. Thunderbolt and USB3 can both have arbitrary trees of nodes.

GTP · on Nov 1, 2022

The parent comment was specifically wondering about HDMI, the other examples are given saying that those have some reason to exist, while he didn't see any reason for having HDMI instead of recycling some other protocol.

friendzis · on Oct 31, 2022

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols

I think this is more of an artefact of horizontal scaling and port-contention. De-facto standard discovery mechanism DNS does not work with ports, so "well-known port" abstraction kinda fails. Http as tunnel mostly avoids/sidesteps this problem.

> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them.

This is weird take or I don't understand it. If you can communicate with an edge node in another network, but the edge node has issues communicating with some inner node (on your behalf), then, as a user, you have no hope of fixing that connectivity issue anyway, regardless of whether layered approach is used or not. This may be related to previous point about http as universal tunnel. Yes, this is a problem, but in a way that communications are effectively terminated at the edge node and monstrosity of stuff happens behind the scenes

patrec · on Oct 31, 2022

> De-facto standard discovery mechanism DNS does not work with ports

Yes, it does, see SRV records.

friendzis · on Oct 31, 2022

I meant DNS A/AAAA queries with preconfigured/well-known ports being the default. While some applications/protocols/services do use some port discovery mechanism, I would argue it is nowhere close to being de-facto standard.

bewo001 · on Oct 31, 2022

So true, but how many developers know about them? The API situation does not help either.

AtNightWeCode · on Oct 31, 2022

I would say ports are mainly a problem on layers below transport even though some tech overuse ports.

simplotek · on Oct 31, 2022

> Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

If this was true then how do you explain that the likes of AWS, the same company who ended up investing in developing their own processor line, doesn't seem to agree that none of the pet peeves you mentioned are worth fixing?

xxpor · on Oct 31, 2022

https://aws.amazon.com/blogs/hpc/in-the-search-for-performan...

emn13 · on Oct 31, 2022

It's not obvious to me that replacing TCP really is harder than designing your "own" chip. Scarequotes here because those graviton chips (that's what you're referring to, I think?) are of course ARM chips, so they're not designing something fresh; they're adapting a very mature design to their own needs. In terms of interoperability, a custom chip based on a standard design is probably a simpler, more locally addressable problem than new network protocols.

Isn't it plausible that graviton was designed yet TCP retained simply because graviton as a project is easier to complete successfully?

gjulianm · on Oct 31, 2022

> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

I work in a company that builds network troubleshooting/observability tools and we have some pretty experienced analysts to tell you what's wrong with the network. With that context, your idea of having any tool automatically diagnosing network issues is a pipe dream.

The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners, failures aren't always easily reproducible, and with human configuration (and therefore errors) almost every step of the way. Even if a tool that "returns each layer of the stack" would be useful, it still would be far from enough to diagnose issues.

EricE · on Oct 31, 2022

"The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners" Ah, how people forget the early days of networking. I remember vividly the early days of the Networld/Interop trade show - Interop was in the name because if, as a vendor, your equipment couldn't integrate with the show network they would throw your booth off the show floor.

That's how bad interoperability in the early days was!

duped · on Oct 31, 2022

Every major corporation has multiple research organizations doing nothing but invest in things that don't have immediate shareholder value.

What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale. The scale at which these changes needs to happen to be practical are at the cluster level in a single data center. If you can propose something that fits that bill there are a few companies willing to pay you millions in salary as an engineering fellow to do it.

simplotek · on Oct 31, 2022

> What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale.

I'd put it differently: it's paying up hundreds of billions in infrastructure to have some sort of gain.

And which gain is that exactly?

I see a lot of "the world is dumb but I am smart" comments in this thread but I saw no one presenting any clear advantage or performance improvement claim about hypothetical replacements. I see a lot of "we need to rewrite things" comments but not a single case being made with a clear tradeoff being presented. Every single criticism of TCP/IP in this thread sounds like change for the sake of change, and changes that aren't even presented with tangible improvements in mind or a clear performance gain.

Wouldn't that explain why TCP is so prevalent, and no one in their right mind thinks of replacing it?

fragmede · on Oct 31, 2022

I mean the goal is more performance, especially if you can get more performance out of the same hardware. Faster setup times, faster connections, more connections, maybe faster teardown. Lower contention on saturated links. Inside of the datacenter is a controlled environment where something like that could work. Replacing TCP over the Internet at large is going to be an uphill battle. Still, if we're replacing the whole thing, then simpler code on the client and server end would be nice.

simplotek · on Oct 31, 2022

> I mean the goal is more performance, especially if you can get more performance out of the same hardware.

Are there actual numbers demonstrating this?

I mean, people are advocating wasting billions revamping infrastructure. What kind of performance are you hoping to buy with those billions? And are those gains worth it, or is just sake for the sake of change?

Sometimes things are indeed good enough.

iso1631 · on Oct 31, 2022

> Every single criticism of TCP/IP in this thread sounds like change for the sake of change, and changes that aren't even presented with tangible improvements in mind or a clear performance gain.

It amuses me that many of those saying "we need to change" are the same ones that bemoan it when car manufacturers remove buttons or make glove boxes operational from a touch screen because they can.

wmf · on Oct 31, 2022

It's figure 1 in the paper. Homa is over 10x faster than TCP (presumably CUBIC).

arka2147483647 · on Oct 31, 2022

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols. And then there's all the non-backend problems!

The paper argues the in '3.1 Stream orientation' section, that stream orientation is a problem for TCP, and says that most apps send messages instead, and the better protocol should handle messages, natively, etc. Which is a good point I think.

But back to TCP. What do you do, if you need to send Messages between applications in TCP? Preferably those Messages would be encrypted also.

You could make up your own protocol, but you probably would rather not! So you use something that is readily available, and does messages, encryption, etc. Would be nice if there were also a ready to use load balancers, caches, tools to debug it, etc

Now, what would be such a protocol.

Why HTTPS, of course.

So I kind of think that the lack of a low level Message Protocol has lead us, as an industry, to coalesce these features bit-by-bit on top of HTTP. It's not perfect by any means, but it does the job.

pclmulqdq · on Oct 31, 2022

HTTPS adds a tremendous amount of overhead to give you messaging. It's a lot better from a hyperscaler's perspective to replace TCP and not use the byte stream abstraction. After all, networks send messages. It's silly to throw that away at one layer and try to get it back at the next layer.

ajross · on Oct 31, 2022

Surely most of your ideas are already being deployed in QUIC/HTTP3. It just happens inside a UDP datagram, for compatibility. Really you're not going to see any new IP protocol layers, there's too much quirky hardware on the network that wouldn't be able to handle it. If we can't even get IPv6 to work all the way to the client, we're never seeing new values for the protocol byte.

vlovich123 · on Oct 31, 2022

Don’t the hyperscaled cloud providers run totally segmented networks? What’s stopping them from using something proprietary internally and just exposing TCP at the end for termination of client connections?

wmf · on Oct 31, 2022

Google already does that.

vlovich123 · on Oct 31, 2022

I’m not aware of them using something other than TCP internally (I’m sure by now they’ve migrated to QUIC but I’m not sure that QUIC necessarily solves some of the scaling challenges / optimizes for gRPC and low latency).

uluyol · on Oct 31, 2022

Google is using remote memory accesses rather than TCP for at least some classes of traffic (e.g. a caching system). They've been publishing details about how it all works too.

Also, they have a transport (Pony express) developed specifically for RPCs, rather than byte streams or datagrams.

Links: https://research.google/pubs/pub51341/, https://research.google/pubs/pub50590/, https://research.google/pubs/pub48630/, more generally https://research.google/pubs/?area=networking

Phelinofist · on Oct 31, 2022

Can someone ELI5 how remote memory access works?

vlovich123 · on Nov 2, 2022

I could be wrong but I believe they have a unified address space. There’s dedicated hardware that then owns a given memory range. On an access it will fetch it from the remote location matching that address on demand and store it in real memory in space allocated to it. Presumably it evicts stuff if there’s insufficient memory. Once the memory is brought over either a virtual address range is remapped to point to main memory or the ASIC just has a TLB itself.

This is pure speculation based on seeing the word ASIC in one of the summaries but it seems like it could be reasonable.

kccqzy · on Oct 31, 2022

I don't think Google-internal communications happen over gRPC. Maybe the protocol was design with an ambition to replace their internal RPC system but it probably failed at that.

They have a new system called Snap although judging from the paper I don't think it can completely replace TCP: https://research.google/pubs/pub48630/ My understanding is that Snap enables new use cases including moving functionality previously done via RPCs to RDMA-like one-sided operations. I think it is complement to RPCs but does not replace it.

ajb · on Oct 31, 2022

They do, it's called DCTCP. Although it's actually an open standard.

bigDinosaur · on Oct 31, 2022

Your ideas are interesting, can you link to or explain a concrete example though? The idea of everything magically debugging itself doesn't apply to a single piece of software I've ever seen, so I'm curious what kind of design would lead to that being possible.

Areading314 · on Oct 31, 2022

Heres an example of an improvement to sending large files over long distances -- Tsunami protocol. It tries to get a best of both worlds to limit the detrimental effect of synchronous roundtrips in the TCP protocol for file transfers:

https://tsunami-udp.sourceforge.net/

jstimpfle · on Oct 31, 2022

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP

If you have ever used multiple TCP or UDP connections in parallel on a single machine (doesn't matter if server or client) then you should realize that ports are required.

Apart from that, you can run HTTP on other ports than 80. You can also use HTTP to load balance or do service discovery by means of redirects. (Caveat, I don't work in this field and can't say how solid the approach works in practice).

robertlagrant · on Oct 31, 2022

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is just not true. Stuff needs to be funded and worth doing, and the internet, like almost everything, is built on making things worth paying for, but there are also loads of improvements everywhere are being made.

fragmede · on Oct 31, 2022

What is QUIC in your book?

Given, say, $50 million of dev time, what would you go about fixing? And in what way?

OmarAssadi · on Oct 31, 2022

In addition to QUIC, KCP [1] is another reliable low-latency protocol that sits on top of UDP that might be interesting. And unlike RFC 9000/9001 (QUIC), encryption is optional. I haven't really seen it mentioned much outside of primarily China-focused projects, like V2Ray [2], but there is also some English information in their Git repo [2].

[1]: <https://github.com/skywind3000/kcp>

[2]: <https://www.v2fly.org/en_US/>

[3]: <https://github.com/skywind3000/kcp/blob/master/README.en.md>

xfs · on Oct 31, 2022

KCP uses a brute force congestion control algorithm that is unfair and inefficient. It is also poorly specified, which is probably why it is less commonly used outside circumvention circles.

ronsor · on Oct 31, 2022

KCP is notably used by the popular mobile game Genshin Impact.

fomine3 · on Oct 31, 2022

Still is looks interesting for some use cases, even though it's not fair if it's fully utilized on the internet.

Luker88 · on Oct 31, 2022

IMHO QUIC is nice, but a disappointment, since it could have been so much more.

Does not handle unreliable messages, still only (multi)streaming, no direct support for multicast, 0-rtt which need a lot of stuff to be manually done TheRightWay or risk amplification attacks, the (imho) under-researched (and removed) forward error correction, and more.

I just restarted working on what I consider to be the solution to this, federated authentication and a bit more, but $50M is too far to be even a dream since I am not google.

Areading314 · on Oct 31, 2022

Doesn't QUIC still run over TCP? I thought it was a replacement for HTTP not TCP (Edit: looks like it replaces TCP and HTTP)

stevewatson301 · on Oct 31, 2022

QUIC runs over UDP, and provides streams and encryption. HTTP/3 is designed to take advantage of QUIC streams (replacing HTTP/2 streams which were problematic due to TCP head of line blocking).

The RFCs are a bit elaborate so folks interested might want to look at this instead[1], which has one of the RFC authors explaining the basics of QUIC and HTTP/3.

[1] https://www.youtube.com/watch?v=cdb7M37o9sU

klabb3 · on Oct 31, 2022

It replaces TCP+TLS, and runs multiple streams on the same conn, supports transition from eg wifi to ethernet on at least one of the nodes. And since it's over udp implementations are mostly in user space. Which is good if you want it now, but not great for performance. Ip packets are very small so you gotta have either kernel support for quic or batch IO, otherwise it's often CPU limited (yes, really). In addition congestion control is wonky, unfortunately. In my experience (quic-go), it's too shy in the presence of TCP streams, which ends up getting more bandwidth. But that depends on the algorithm used, implementation and God knows what else.

notpushkin · on Oct 31, 2022

I guess you were thinking about another clever name protocol, SPDY :-)

SPDY → HTTP/2

QUIC → HTTP/3

jhardy54 · on Oct 31, 2022

Nope, UDP.

still_grokking · on Nov 1, 2022

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

I would affirm that. It's imho true for almost everything in IT tech.

How computers "work" today is just pure madness when looked anyhow closer.

Everything's a result of some "historic accidents" back in the days, and from that the usual race to the bottom caused by market powers.

Nobody is willing to touch any of the lower layers no matter how crazy they are form today's viewpoint. We just shovel new layers on top to paper over the mistakes of the past. Nothing gets repaired, or actually what would be more more important, rethought form the ground up in light of new technological possibilities and changed requirements.

I understand from the economic standpoint how this comes. But I'm also quite sure we didn't make any fundamental improvements in the last 50 years of computing.

That's a very bad sign when everything in a field that's not even really 100 years old is frozen in time since 50 years because everything's so fragile and complex that fundamental changes aren't possible. This looks like a text book example of a house of cards…

Given how vital IT tech is to modern life I fear that this will crash at some point in the worst way possible.

And even if it won't crash, which is really strongly hope, we will never have nice things again as nothing of the old rotten things can be reasonably changed.

KaiserPro · on Oct 31, 2022

> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues

Thats virtual networking. but that introduces latency if its not well configured.

> But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf

not really, assuming you have the right fabric, its nowhere near as hard as that. Plus you seem to be forgetting that there is more to the network than TCP. There is a whole physical layer that has lots of semantics that greatly affect how easy it is to debug higher levels.

starfallg · on Oct 31, 2022

> We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

This violates the principle of encapsulation that the entire field of networking is based on, not to mention an massive security hole.

cdogl · on Oct 31, 2022

I’ll defer to experts on the network-layer problems but im not sure what you see as the problem with converging on HTTP. It’s awkward and inelegant, but as an a backend application developer I never feel like it gets in my way.

guenthert · on Oct 31, 2022

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols.

Nobody forces them though. It would be much easier to publish a standard port number mapping than to develop a (or multiple) new protocols. Now you just need to motivate people to use it.

peter_retief · on Oct 31, 2022

At last hopefully there is light at the end of the tunnel. Big question for me is who is going to build it?

KaiserPro · on Oct 31, 2022

I get where this is coming from, but no. We don't need to replace TCP in the datacentre.

Why?

because for things that are low latency, need rigid flow control, or other 99.99% utilisation case, one doesn't use TCP. (Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

Look if it really was that much of a problem then everyone in datacentres would move to RDMA over infiniband. For shared memory clusters, thats what's been done for years. but for general purpose computing its pretty rare. Most of the time its not worth the effort. Infiniband is cheap now, so its not that hard to deploy RDMA[1] type interconnects. Having a reliable layer2 with inbuilt flow control solves a number of issues, even if you are just slamming IP over the top.

shit, even 25/100gig is cheap now. so most of your problems can be solved by putting extra nics in your servers and have fancypants distributed LACP type setups on your top of rack/core network.

The biggest issue is that its not the network that's constraining throughput, it either processing or some other non network IO.

[1]I mean it is hard, but not as hard as implementing a brand new protocol and expecting it to be usable and debuggable.

pclmulqdq · on Oct 31, 2022

The reality of today's large datacenters is that almost all of them have almost all of their traffic on TCP unless the owners of the datacenter have made a conscious effort to not use TCP. The highest-traffic applications, usually databases and storage systems, pretty much all use TCP unless you are buying a purpose-built HPC scale-out storage system (like a Lustre cluster). Most people who build a datacenter today use databases or object stores for storage, not Lustre or dedicated fiber channel SANs. On top of that, pub/sub systems all use TCP today, logging tends to be TCP, etc.

KaiserPro · on Oct 31, 2022

Fibre channel is dead, long live fibre channel.

I agree a lot of things are on TCP, but I don't think its a massive problem, unless you are running close to the limit of your core network. And one solution to that is to upgrade your core network....

Failing that, implementing some load balancing/partitioning systems to make sure data-processing affinity is best matched. This the better solution, because it yields other advantages as well. But its not the easiest, unless you have a good scheduler

pclmulqdq · on Oct 31, 2022

I will also add that one of the big problems with TCP is that it is impossible to load balance without knowledge of the L4 protocol. You can't load balance a byte stream. That means writing your own load balancer unless you want to also accept http overhead.

still_grokking · on Nov 1, 2022

What is supposed to be a good scheduler?

Genuine question as I (as a software dev) have no clue how modern DCs are build in detail.

KaiserPro · on Nov 1, 2022

Its very much down to your workload and how you want it to work.

Short answer: k8s/fargate/ECS/batch will do what most people want. Personally I'd steer clear of k8s until you 100% need that overhead. Managed services are ok.

Long answer:

K8s has a whole bunch of scheduling algorithms but its a jack of all trades, and only really deals with very shallow dependencies (there are plugins but I've not used them). For < 100 machines it works well enough. (by machine I mean either a big VM or physical machine). Once you get more than that, K8S is a bit of a liability. its chatty, slow and noisy. It also is very opinionated. And let us not start on the networking.

Like most things there are tradeoffs. Do you want to prioritise resiliency/uptime over efficiency? do you want to have batch processing? do you want it to manage complex dependencies (ie service x needs 15 other services to run first, which then need 40 other services) are you running on unreliable infra (ie spot instances to save money) do you need to partition services based on security? are you running over many datacentres and need the concept of affinity?

More detail:

The scheduler/dispatcher is almost always deigned to run the specific types of workload that you as a company run. The caveat being that this only applies if you are running multiple datacentres/regions with thousands of machines. Google have a need to run both realtime and batch processing. But as they have millions of servers, making sure that all machines are running at 100% utilisation (or as close to it as practical) is worth hundreds of millions. Its the same with facebook.

Netflix I guess has a slightly different setup as they are network IO orientated, so they are all about data affinity, and cache modelling. For them making sure that the edge serves as much as possible is a real cost saving, as bulk bandwidth transfer is expensive. The rest is all machine learning and transcoding I suspect (but that's a guess)

still_grokking · on Nov 1, 2022

Thanks for the reply!

Not really an answer to the scheduler question, but at least it mirrors some of my experience.

That K8s is something to avoid, and that it does not scale, is a known (at least to me).

But that doesn't answer what people would put on the metal when building DCs…

I was not asking out of the perspective of an end-user. I was asking about (large) DC scale infra. (As dev I know the end-user stuff).

As I see it: You can build your own stuff from scratch (which is not realistic in most cases I guess), or you can use OpenStack or Mesos. There are no more alternatives at the moment I think, and it's unlikely that someone comes up with something new. OTOHS that's OK. A lot of people will never need to build their own DC(s). For smaller setups (say one or two racks) there are more options of course. (You could run for example Proxmox and maybe something on top).

KaiserPro · on Nov 2, 2022

Sorry yeah, I didn't really answer your question.

Here is a non-exhaustive list of schedulers for differing use cases:

https://slurm.schedmd.com/documentation.html << mostly batch, not entirely convinced it actually scales that well

https://www.altair.com/grid-engine << originally it was sun's grid engine. There are a number of offshoots tuned for different scenarios.

https://rmanwiki.pixar.com/display/TRA/Tractor+2 << thats for the VFX crowd

https://www.opencue.io/ << open source which is vaguely related to the above

https://abelay.github.io/6828seminar/papers/hazelwood:ml.pdf << Facebook's version of airflow. It sits on top of two schedulers, so its not really a good fit. I can;t find what they publicly call their version of the borg

I'm assuming you've read about borg

As you've pointed out mesos is there as well.

still_grokking · on Nov 2, 2022

Cool! Thanks! That's a lot of stuff I didn't hear about until now.

It's really nice that one can meet experts here on HN and get free valuable answers form them. Thank you.

wmf · on Oct 31, 2022

You're missing the fact that Stanford is the farm team for Google and Google is hyperscale. At scale, your "just spend more money" solutions are in fact more expensive than creating a new protocol. And like k8s, the new protocol can be sold to startups so they can "be like Google".

KaiserPro · on Oct 31, 2022

You're missing the point that maybe, just maybe, I'm part of a team that looks after >5 million servers.

You might also divine that while TCP can be a problem, a bigger problem is data affinity. Shuttling data from a next door rack costs less than one that's in the next door hall, and significantly less than the datacentre over. With each internal hop, the risk of congestion increases.

You might also divine that changing everything from TCP to a new, untested protocol across all services, with all that associated engineering effort, plus translation latency, might not be worth it. Especially as now all your observability and protocol routing tools don't work.

quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

As a side note "oh but they are google" is not a selling point. Google has google problems half of which are things related to their performance/promotion system which penalises incremental changes in favour of $NEW_THING. HTTP2.0 was also a largely google effort designed to tackle latency over lossy network connections. which it fundamentally didn't do because a whole bunch of people didn't understand how TCP worked and were shocked to find out that mobile performance was shit.

specialist · on Nov 4, 2022

> a bigger problem is data affinity

For future, please write about how typical cloud customers can design for better data affinity.

Or is it just handled by the provider?

FWIW, at a prev gig, knowing nothing about nothing, I finally persuaded our team to colocate a Redis process on each of our EC2 instances (along side the http servers). Quick & dirty solution to meet our PHBs silly P99 requirements (for a bog standard ecommerce site).

Apologies for belated, noob question.

benlivengood · on Nov 1, 2022

> quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

So your 5M machines / 40 in the best case of all 1U boxes is 125K TOR-switch-SWE-week-equivalents / 52 weeks in a year which comes to 2K SWE-years to invest in new protocols, observability, and testing. Google got to the scale they are by explicitly spending on SWE-hours instead of Cisco.

KaiserPro · on Nov 1, 2022

> explicitly spending on SWE-hours instead of Cisco.

I strongly doubt that TOR switches are cisco

KaiserPro · on Nov 2, 2022

but to answer your further case. The point is you don't need to replace all the TOR switches. Only the ones that deal with high network IO.

to change protocol you need gateways/loadbalancers either at the edge of the DC just after the public end points, or in the "high speed" areas that are running high network IO. For that to work, you'll need to show its worth the engineering effort/maintenance/latency.

still_grokking · on Nov 1, 2022

Google does not use K8s internally.

They never did, they won't ever do that!

K8s does not scale. Especially not to "Google scale".

First step to "be like Google" would be to ditch all that (docker-like) "container" madness and just compile static binaries. Than use something like Mesos to distribute workloads. Build literally everything as custom made on purpose solutions, and avoid mostly anything off the shelf.

"Being like Google" means not using any third party cloud stuff, but build your own in-house.

But this advice wouldn't sell GCP accounts. So Google does not tell you that. They telling you instead some marketing balderdash "how to be like Google".

ksec · on Oct 31, 2022

AWS is True HyperScale. Even more so than Google. And yet their spend more money solution on hardware seems to work fine.

fragmede · on Oct 31, 2022

Do we know for a fact that AWS does or doesn't use TCP on their backend? https://news.ycombinator.com/item?id=33402364 leads me to believe Google doesn't.

TheRealDunkirk · on Oct 31, 2022

God, I love it when the talk turns hyper-technical around here, and the Jedi masters turn up.

amluto · on Oct 31, 2022

The paper explicitly addresses Infiniband.

KaiserPro · on Oct 31, 2022

not really. they conflate infiniband with RoCE which given they have different semantics on congestion control, I'd say is a bit of a whoopsey.

if they are using RoCE, are they using DCB to avoid loss(well make it "lossless")? the paper implies otherwise.

birdyrooster · on Oct 31, 2022

For those who don’t know, RoCE is somewhat of a failure in the marketplace right now.

throw0101a · on Oct 31, 2022

For those who don't know, RoCE = RDMA over Converged Ethernet.

* https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

> RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.

* https://docs.nvidia.com/networking/pages/viewpage.action?pag...

bayindirh · on Oct 31, 2022

IB does not work TCP/IP by default. You can either run TCP over IB, which has a performance penalty, or you can directly run in Ethernet mode, which is something completely different.

colinmhayes · on Oct 31, 2022

> The biggest issue is that its not the network that's constraining throughput, it either processing

To be fair the paper talks a bit about how TCP makes multithreading slower compared to a message based system.

counttheforks · on Oct 31, 2022

> Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

So storage doesn't use TCP, except for the protocols that are actually used, which do use TCP?

KaiserPro · on Oct 31, 2022

Depends on what you are using, for connecting block stores, you'll use some sort of fabric. That is Fibre channel, SAS, NVME over something or other

If you are using GPFS, then you can do stuff over IB, but I don't know how that works. Lustre I imagine does lustre things over RDMA.

For everything else, NFS all the things. pNFS means that you can just throw servers at the problem and let the network figure it out.

But again, if IO speed is critical, you move IO over to a dedicated fabric of somesort. for most thing NFS is good enough. (except databases, its possible but not great. but then depending on your docker setup, you might be kneecaping your performance because overlayfs is causing io amplification)

josephg · on Oct 31, 2022

> The data model for TCP is a stream of bytes. However, this is not the right data model for most datacenter applications. Datacenter applications typically exchange discrete messages to implement remote procedure calls

This isn't just a datacenter problem. Every single network protocol I've ever created or implemented is message based, not stream based. Every messaging system. Every video game. Every RPC transport.

But, because we can't have nice things, message framing has to be re-implemented on top of TCP in a different, custom way by every single protocol. I've basically got message framing-over-TCP in muscle memory at this point, in each of the variants you commonly see.

The only kinda-sorta exceptions I know about are HTTP/1.1 and telnet. But HTTP/1.1 is still a message oriented protocol; just with file-sized messages. (And even this stops being true with http2 anyway).

In my opinion, the real problem is the idea that "everything is a file". Byte streams aren't a very useful abstraction. "Everything is a stream of messages" would be a far better base metaphor for computing.

yaantc · on Oct 31, 2022

SCTP [1] is there to provide a reliable message based protocol. And it does work inside a datacenter. The issue is outside the datacenter: it doesn't work reliably across the Internet due to middle boxes dumping anything not TCP or UDP...

But inside a controlled environment like a datacenter, it works. It's been used in the telecommunication world to carry control messages in the radio access and core networks for example. So it's been tested at scale for critical applications.

[1] https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...

cryptonector · on Oct 31, 2022

> The only kinda-sorta exceptions I know about are HTTP/1.1 and telnet. But HTTP/1.1 is still a message oriented protocol; just with file-sized messages. (And even this stops being true with http2 anyway).

No, HTTP/2 and QUIC do not change the semantics of HTTP.

Also, you can have endless streams with HTTP/1.1: just use chunked encoding to POST/PUT and use Range: bytes=0- and chunked encoding for GET and chunked encoding for POST response bodies. In HTTP/2 there's only the equivalent of chunked encoding -- there's no definite content length in HTTP/2.

josephg · on Oct 31, 2022

Correct me if I’m wrong, but doesn’t h2 still break up requests and responses into smaller message frames in order to do multiplexing?

Those message frames are what I’m talking about - as I understand it, they are, yet again, a message oriented protocol layered on top of tcp.

cryptonector · on Oct 31, 2022

You will always need _some_ framing if you're dealing with bulk data.

If it's not bulk, then you don't need framing if everything can fit into one datagram / whatever transfer unit provided by the transport, but the transport itself will need some framing, especially if it will need to support any kind of fragmentation.

If it's not bulk and it doesn't fit in a datagram / whatever transfer unit provided by the transport and the transport doesn't do fragmentation, then you have to do framing yourself, and then you have a sequencing problem, and so on, and you quickly re-invent parts of TCP but at the application layer.

Basically, it seems inescapable that the Internet is based on packets, and that packets are limited in size, and so application protocols have to be smeared onto packets.

Things are only ever trivial when you're doing request/response protocols with always- or mostly-small requests and responses. The moment you need anything that doesn't fit in the path MTU minus overhead, you need framing.

So I don't think that an octet stream abstraction is quaint and obsolete.

cryptonector · on Oct 31, 2022

Yes, and even HTTP/1.1 does that. I forget if telnet does something similar, but I suspect it must because it can send control data. The FTP protocol, and the BSD r-command might be the only ones that truly do no additional framing (FTP for data connections, r-commands post-login).

amluto · on Oct 31, 2022

Once chunked encoding is in the picture, even HTTP/1.1 sends messages, not streams, under the hood.

cryptonector · on Oct 31, 2022

Correct.

astrobe_ · on Oct 31, 2022

I don't think so, or we disagree on the meaning of words. Chunking is not message-oriented in the same sense that UDP is.

With chunking, you basically just insert markers into the stream; this does not imply by any mean that the stream has been split in multiple messages - as a matter of this is taken care of by the lower levels of the client/server, the middle/higher levels certainly don't want to deal with it. This only is a perverted solution to the problem of dynamically generated "messages" (mainly HTML pages), that has been further perverted to implement gruesomely message-oriented "protocols" (Comet and others, IIRC).

UDP, on the other hand is based on datagrams. They can be split into smaller packets on the wire but they are reassembled at the network stack level so no program can even see it happened unless they insist on it.

Websocket is much closer to a message oriented protocol over a streaming pipe than HTTP chunking is.

cryptonector · on Oct 31, 2022

> I don't think so, or we disagree on the meaning of words.

Well, when there's four people in the conversation, that happens.

u/josephg's complaint at the top of this thread is that people use TCP but still have to add framing in their application protocols. u/josephg said something to the effect of how few protocols do no framing and mentioned HTTP/1, but even HTTP/1.1 w/ chunked transfer encoding adds framing, and even HTTP/1.0 w/ definite content-length also has framing (CRLFs) for the request headers themselves, and effectively frames bodies with CRLF at the start and EOF at the end.

Chunks are definitely not datagrams, just as TLS records aren't either, and just as TCP segments aren't either. But they have framing, which u/josephg complained about.

Framing of some sort is unavoidable. My point, besides the inevitability of application-layer framing, is that a datagram- or message-oriented transport won't make things trivial for apps anymore than TCP did.

MisterTea · on Oct 31, 2022

> In my opinion, the real problem is the idea that "everything is a file".

Files are just an indexed list of bytes that can represent anything. Think of them as objects.

> Byte streams aren't a very useful abstraction. "Everything is a stream of messages" would be a far better base metaphor for computing.

I don't understand this. A stream of messages is a stream of bytes. Its bytes all the way down.

Aissen · on Oct 31, 2022

Because most protocols can handle message loss, with retransmit and proper ordering ? And we haven't started to talk about congestion yet… TCP is useful, and while I'd like to see a message get rid of one (or more) of those constraints to go with a a custom protocol, I feel like they'd be re-implementing the features in the end because these are very useful properties to have…

Edit: the proposal in the article is actually quite, sensible, but requires redesigning your apps… And I'd like to see how it performs: TCP is a hugely optimized beast (when it works well), with hardware offloads, kernel optimizations, etc.

8371861215 · on Oct 31, 2022

[flagged]

Kiro · on Oct 31, 2022

Very peculiar spam. Does anyone have a theory what the motive is?

askvictor · on Oct 31, 2022

Modern equivalent of a Numbers Station https://en.wikipedia.org/wiki/Numbers_station

ithkuil · on Oct 31, 2022

Even TCP itself uses discrete messages under the hood :-)

josephg · on Oct 31, 2022

Hah yes; although TCP frames can be arbitrarily refragmented and rejoined as they travel through the network before reaching your destination.

If you ever play an indie game which seems unusually janky over wifi, its probably because the code isn't correctly rejoining fragmented network packets at the application level. Ethernet is remarkably well behaved in this regard. Wifi is much better at shaking out buggy code.

pgorczak · on Oct 31, 2022

Finding the right abstraction isn’t easy in a network stack. TCP’s is less useful for application logic but it reflects the way the protocol works internally. Bunch of Bytes goes in on one side, Bytes stream out on the other side in chunks whose size depends on congestion, physical layer and other facts. A message based API hides these facts or leaks them depending on how you look at it.

bruce343434 · on Oct 31, 2022

What’s the issue with fitting messages in streams?

bheadmaster · on Oct 31, 2022

One of the issues I can think of is head-of-line blocking [0].

If you're sending messages of different priorities over the same channel, an error in sending a low-priority message, high-priority messages will have to wait until the low-priority message is properly re-transmitted.

[0] https://en.wikipedia.org/wiki/Head-of-line_blocking

https://en.wikipedia.org/wiki/Head-of-line_blocking

fulafel · on Oct 31, 2022

Yes, but this need happen only when your data center network is congested, which is hopefully rare and is relatively cheap to fix by adding capacity. And in congestion cases you need TCP's back off ability.

Getting rid of head-of-line blocking also makes messages happen in order. Making messages (sometimes) happen out of order would drastically increase implementation complexity for a lot of apps.

imtringued · on Oct 31, 2022

I don't bother implementing that crap anymore. I just websockets which are message oriented out of the box.

langsoul-com · on Oct 31, 2022

What do you make to require implementing or making network protocols?

josephg · on Oct 31, 2022

Multiplayer video games, realtime updating webpages, database bindings, sync protocols (I play around around with CRDTs a lot), p2p distributed systems, whatever really!

Implementing a wire protocol seems to come up about once every couple of years. And for context, I've been programming now for about 30 years.

Luker88 · on Oct 31, 2022

A few years ago (when QUIC was coming out) I was developing the theory of a new transport/encryption/authentication protocol. The focus was as much on transport as in the built-in federated authentication.

There was not much interest in the field and I had a lot of the theory and formal proofs, but no implementation.

This month I found a cofounder and we are reordering a lot of the information and presentation, we should start asking for funds in more or less a month.

I still believe my solution to be much more complete than anything in use today (again: on paper), but since there seems to be some interest today, I'll ask here: Can anyone suggest some seed funds to check for a starttup? We will be based half US, half EU.

For more details, fenrirproject.org (again: old stuff there, ignore the broken code)

mikepurvis · on Oct 31, 2022

Very interesting stuff, but not going to lie, I have trouble imagining how you'd build a viable company around this kind of thing.

Infrastructure/protocol companies are always going to be a very tough sell (Sandstorm) unless there's a compelling freemium model like with GitLab, Cloudbees, Sentry, etc.

Luker88 · on Oct 31, 2022

The infrastructure/protocol will need to remain open, since this kind of thing works better the bigger the user base is. it will probably spin off to its own foundation as soon as it is viable.

The income will come from another project built directly on this, on managing the domain and its users/devices, plus other stuff, mainly for businesses.

I don't see much need to go into details right now, but we have a clear distinction in mind between what is infrastructure and what will be the product.

Again, still in the housekeeping phase, just looking for potential future funds once we finish this phase

kanwisher · on Oct 31, 2022

Why don’t you actually build gasp a prototype before asking for money

Luker88 · on Oct 31, 2022

yeah, thank you for the kind comments about not needing money (aka: my time as no value) and asking me to build the prototype with irony.

As I said, the project was started a few years back, and since I did not have the time to work on it, maybe it means my life does not give me the time and money to build this on the side.

But I'll always find it funny how half of the people go "you need to have solid theory proofs before" and the other half goes "where is the working code".

As I said we just started some housekeeping and are not ready to start, and as many point out, it's hard to make money on infrastructure. I know, I did not ask how to make money on this. The idea is to keep the base as open as possible and make money on other service built on this. I was only asking on pointers to funds interested on tech loosely connected to this. And if they don't like our current state or something else fine, no need for you to do their job, and in a witty way, too.

kortilla · on Oct 31, 2022

> yeah, thank you for the kind comments about not needing money (aka: my time as no value) and asking me to build the prototype with irony

The world (read: the investment community) doesn’t care about your time though. If you are going to pitch an infrastructure project that should work in software without a working demo, you’re going to have to take a huge haircut on valuation.

If you want to be successful with a theory and proofs, join academia and publish them. If you want to get investors for infrastructure, make something that works.

Academia churns out “protocols on paper” every year that go absolutely nowhere. You need to differentiate yourself if you’re looking for more than a research grant.

kanwisher · on Nov 2, 2022

Doubtful anyone is going to throw money at you for some random idea, that you won’t even invest some of your free time on. You clearly don’t value it, so no one else will.

throwaway41597 · on Oct 31, 2022

Depending on the scope, it's not always possible to self-fund while starting a project.

coder543 · on Oct 31, 2022

I'm just going to mention that NATS can be used as a general purpose transport, with encryption and a surprisingly capable authentication and authorization system. It also supports federating into clusters and superclusters. NATS has also come a long way in the last several years, in case anyone is thinking of some experiences they had years ago when it didn't have all these features.

The question would have to be "what does your idea/project offer that NATS doesn't already offer?"

I have no affiliation with NATS, but I wish that people were paying more attention to it. It solves a lot of problems people have.

hknmtt · on Oct 31, 2022

the thing you want to make requires ZERO money.

tptacek · on Oct 31, 2022

"We hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of datacenter networks".

Flow-consistent routing is the constraint that packets for a given TCP 4-tuple get routed through the same network path, rather than balanced across all viable paths; locking a flow to a particular path makes it unlikely that segments will be received out of order on the destination, which TCP handles poorly.

hinkley · on Oct 31, 2022

Or, by sending the traffic over all routes, there is no way to keep one server from monopolizing all traffic, because each route is oblivious to the stress currently being experienced by all its peers. It has to set a policy using local data, not global data.

The usual failure mode for clever people thinking about software is taking their third person omniscient view of the system status and thinking they can write software that replicated what a human would do in that situation. We are still so very far from human level intuition and reasoning.

wmf · on Oct 31, 2022

Ultimately one server cannot inject more than one link worth of traffic (e.g. 100 Gbps) into the network which is a tiny fraction of total capacity. Researchers have gotten really good results with "spray and pray" for sub-RTT flows combined with latency and queue depth feedback for multi-RTT flows.

hinkley · on Oct 31, 2022

Spray and pray sounds like a reasonable fit for UDP, no?

We’ve had these sorts of bottlenecks before, and they didn’t last. It’s always possible something fundamental changed, but it’s also possible that we are doing something wrong as the motherboard or OS levels and adopting new solutions puts us right back in that space where a couple of servers can easily saturate a network.

If a network card can move data as fast or faster than the main memory bus on a computer then what are we even doing? Should we be treating each subsystem as a special purpose computer and turn the bus into a network switch?

NavinF · on Oct 31, 2022

You just described the motivation behind infiniband (and RDMA in general)

pram · on Oct 31, 2022

The network is the computer™

hinkley · on Oct 31, 2022

Well, I mean yeah, that silly slogan is definitely rattling around in my head.

throwaway892238 · on Oct 31, 2022

And we could totally construct systems that take some approximation of a global internet state into local routing decisions. But that might devalue some incumbent player's position in the market (or create a new privileged set of players) so even if we made a POC, it wouldn't get adopted.

ghshephard · on Oct 31, 2022

This is true, and the congestion mentioned here was subtle and not called out - typically flows are handled in a stateless manner by load balancers that hash on some set of MAC/IP/PORT features of the packet. This is where congestion occurs and the paper mentions it here:

    All that is needed for congestion is for two large flows
    to hash to the same intermediate link; this hot spot will persist 
    for the life of the flows and cause delays for any other
    messages that also pass over the affected link.

It makes logical sense, but I'd love to see the evidence for this.

topranks · on Oct 31, 2022

“Elephant” flows are a definitely a thing.

It all depends on the application and overall use in of the network.

With sufficient flows and a mix of sizes it’ll still tend to even out. But if you’ve significant high-throughout, long lived flows this is definitely something you might hit.

mlerner · on Oct 31, 2022

I wrote a summary of one of the approaches for replacing TCP mentioned in the paper (called Homa) here: https://www.micahlerner.com/2021/08/15/a-linux-kernel-implem...

defrost · on Oct 31, 2022

Jumping to the end:

> TCP is the wrong protocol for datacenter computing.

> Every aspect of TCP’s design is wrong: there is no part worth keeping.

I cannot disagree and Ousterhout argues well.

> Homa offers an alternative that appears to solve all of TCP’s problems.

I'm well behind the curve on protocols and now I have something to learn more about.

> The best way to bring Homa into widespread usage is integrate it with the RPC frameworks that underly most large-scale datacenter applications.

More or less the case for whatever replaces TCP in a tight computing warehouse setup.

colechristensen · on Oct 31, 2022

> Every aspect of TCP’s design is wrong

The driver of most of a global network of computers which has been wildly successful beyond dreams before it was real… probably deserves a better deal than “every aspect is wrong”. It has worked fanatically well and chasing the long tail of performance improvements isn’t equivalent to determining what has gotten us here is wrong.

teraflop · on Oct 31, 2022

You're cherry-picking an interpretation of a single sentence, when it should be read in the context of the preceding one: Ousterhout says every aspect of TCP's design is wrong for (modern) datacenter computing. He's not saying bad decisions were made at the time it was designed, nor even that it's badly designed for other use cases today.

The first few paragraphs of the article give even more context:

> The TCP transport protocol has proven to be phenomenally successful and adaptable. [...] It is an extraordinary engineering achievement to have designed a mechanism that could survive such radical changes in underlying technology.

> However, datacenter computing creates unprecedented challenges for TCP. [...] The datacenter environment, with millions of cores in close proximity and individual applications harnessing thousands of machines that interact on microsecond timescales, could not have been envisioned by the designers of TCP, and TCP does not perform well in this environment

hinkley · on Oct 31, 2022

My Distributed Computing professor said, “now we are going to discuss why Ethernet is a terrible protocol but we use it anyway.”

Like democracy, everything else we’ve tried is even worse.

petesergeant · on Oct 31, 2022

"Specifically, Homa aims to replace TCP, which was designed in the era before modern data center environments existed. Consequently, TCP doesn’t take into account the unique properties of data center networks (like high-speed, high-reliability, and low-latency). Furthermore, the nature of RPC traffic is different - RPC communication in a data center often involve enormous amounts of small messages and communication between many different machines."[0]

0: https://www.micahlerner.com/2021/08/15/a-linux-kernel-implem...

Spivak · on Oct 31, 2022

Everything about the protocol being wrong for the specific case of machines directly wired to one another over a high speed reliable network is not an admonishment of the protocol in general. And the protocol, being an abstract concept, doesn’t have feeling to hurt.

curious_cat_163 · on Oct 31, 2022

"Although Homa is not API-compatible with TCP, it should be possible to bring it into widespread usage by integrating it with RPC frameworks."

I was about to rant that Prof. Ousterhout should just deploy some of his students and get that transport - RPC integration done and prove out his point. But, then I tried to look for it first and found this:

https://www.usenix.org/system/files/atc21-ousterhout.pdf

Has anybody tried it in an actual data-center?

mlhpdx · on Oct 31, 2022

> … it should be possible to bring it into widespread usage by integrating it with RPC frameworks.

That’s a powerful word, “should”. Many software in the datacenter are almost as old as TCP itself, in whole or part. Difficult but working, they will continue to linger unless something more than six letters of aspiration is applied to reimagining and rebuilding that considerable bulk.

tptacek · on Oct 31, 2022

It's theoretically much easier to introduce a new transport inside of a DC, since you're inside the network perimeter and you'll generally have control over policy-based filtering decisions.

KerrAvon · on Oct 31, 2022

The context here is TFA advocating for use of higher-level message-oriented frameworks instead of raw socket APIs so that they can use a non-TCP transport without changing the application code.

vlovich123 · on Oct 31, 2022

Within a DC/cloud provider network. This isn’t about protocols you’d see traversing the public internet until the cloud providers see value in a protocol and start to push it out through IETF (eg see QUIC which was done by a company that owned both the browser and the data center). If there’s a “small” SW improvement that lets you use your HW 10x more efficiently that’s totally worth it given the end of scaling. You either invest in SW or pay for custom ASIC development. You’re not getting a free lunch anymore by just waiting a few years and getting that 10x gain for “free”.

BirAdam · on Oct 31, 2022

Well, the key would be to develop and deploy Homa in a DC and test in implementation at scale. If it actually ameliorates the perceived shortcomings of TCP that make nothing in TCP worth keeping as this author says, then cool. My only complain with issues like this is the cost of implementation. Someone has to pay to build a DC around it or increase the cost of maintenance for several years to a decade while supporting two completely different and incompatible networks.

adobrawy · on Oct 31, 2022

Homa protocl can be deployed on the basis of existing switches (not all of them, but some of them will play well). See also https://github.com/PlatformLab/HomaModule .

In addition, a closed environment and the SDN's popularity in large data centers is a significant cost-reducing factors compared to typical IPv6 deployment.

MuffinFlavored · on Oct 31, 2022

Dumb question, there's no way to talk to a PC over the Internet with Homa, right? Since our home ISPs + routers are all only doing UDP/TCP over IPv4/IPv6? Homa is mainly for "LAN"?

wmf · on Oct 31, 2022

MuffinFlavored · on Nov 1, 2022

Will we ever see the day where computers over the Internet talk to each other over something other than UDP/TCP over IPv4/IPv6?

wmf · on Nov 1, 2022

Given Google and Apple's efforts to de-ossify the Internet, maybe. But no time soon.

topranks · on Oct 31, 2022

Hmm I didn’t really read about it.

The proposal here is to replace IP as well as TCP?

Good luck pulling that off.

andersced · on Oct 31, 2022

We have been testing out various protocols to overcome in our case TCP head-of-line blocking by using the protocols->

SRT: https://github.com/Haivision/srt (C++ wrapper https://github.com/andersc/cppSRTWrapper)

RIST: https://code.videolan.org/rist/rist-cpp

KCP: https://github.com/Unit-X/kcp-cpp

We wrap all data in a common container format https://github.com/agilecontent/efp

To decouple the data from the transport.

Yes the above solutions are media centric but can be used for almost any arbitrary data.

The protocols are not 'fare' so starvation may happen, and must be handled on the application level.

/A

lizknope · on Oct 31, 2022

For those unfamiliar with the author.

https://en.wikipedia.org/wiki/John_Ousterhout

He is probably most famous for having created the Tcl language and Tk GUI library. He also worked on the Sprite distributed operating system, the Magic VLSI design tool, and a bunch of other things.

eigenrick · on Oct 31, 2022

Also, more recently, one of the inventors of the RAFT algorithm.

dragontamer · on Oct 31, 2022

Hmm, haven't read the paper yet, but I immediately did "Ctrl+F sctp" and didn't find anything.

I know that sctp was the next generation stream-oriented protocol designed to fix the out-of-order message problem in commnications, as well as a whole bunch of connection issues (4-way handshake instead of 3-way for better open/close. Datagram oriented in-order stream, so that every packet has a proper size involved. Etc. etc.)

As far as I know, sctp should solve all the requirements in section 2 of this paper (except "load balancing", which might be solved by lower-level protocol sharing of some kind?). So its weird to not see sctp discussed.

--------

Yeah, sctp ain't popular, but these exact sets of problems / requirements and issues with TCP have been known for decades. SCTP, is also a decades-old protocol (though not as old as TCP), and is the most obvious solution to the problem (and already supported by Linux).

nickdothutton · on Oct 31, 2022

23 years ago I sat in a meeting with Sun, Intel, Mellanox, and 1 or 2 others. In that meeting we discussed putting an RDMA interface on individual hard drives, trays of RAM, CPUs, and other more exotic devices (like battery backed RAM, no conventional SSD in those days of course). You’d install RAM 1 42U rack at a time, disks likewise, CPUs in another rack and so on. All partitioned, controlled, managed, and of course billed-for by a “data center OS”.

wmf · on Oct 31, 2022

Disaggregation costs an absolute fortune. The network is 10% of the total datacenter cost and the network does not carry memory and PCIe traffic. If you make the network 10x faster to carry that, now it's 90% of the cost.

tristor · on Oct 31, 2022

I was in the room for similar discussions as part of the OpenCompute project, which was around remote IO and resource disaggregation. There are systems like this, and they may make sense for certain use cases, but generally speaking hyperscalers today are built around virtualization which doesn't align well to this model.

EricE · on Oct 31, 2022

The network is the computer....

ggm · on Oct 31, 2022

Out of order delivery is fine in TCP within the window. It might be inefficient but it's not impossible, reassembly could be moved to userspace if userspace TCP was used.

I have no problem with alternates to TCP in the DC with a crossbar fabric and far less loss, seems sensible.

I wonder how it would play with QUIC and the session like behaviours now emerging.

hinkley · on Oct 31, 2022

> • In-order packet delivery

This is a bit disingenuous, since it’s not the wire protocol but the kernel API that maintains the in order abstraction. With Jumbo packets you can still push a mountain of data without tripping up on “in order packet delivery”

As developers we like this in order delivery to userspace because it vastly simplifies the code. We make up for the inefficiencies by processing dozens of hundreds of streams in parallel. We aren’t going to give that up just because the wire protocol changes.

kortilla · on Oct 31, 2022

Is it not considered a protocol violation to deliver out of order segments to the upper layer? That seems the same to me as abusing it to not require retransmits either.

Remember, middle boxes can fully adhere to the TCP standards and terminate your TCP connection and enforce ordering. If you notice that, you’re not really following the protocol, you’re just using its header format.

hinkley · on Oct 31, 2022

Yes, but.

My read of the room is that he's conflating wire level and kernel level problems with userspace problems, which is a no-no because if Berkeley userspace has latency problems, we can deal with that separately from undoing 40 years of tribal knowledge in the process.

In the video he says that he was seeing 3x of theoretical latency to userspace that he fixed with Homa, but similar efforts to fix Berkeley Sockets saw 'almost a 2x' improvement which he deemed insufficient. A question I'd like to see answered over the next couple years is what IO APIs will be the most efficient in a world where io_uring is everywhere.

kevin_nisbet · on Oct 31, 2022

There was a time when out of order packets triggered congestion handling in TCP stacks which drastically reduces performance. This is where the concern comes from. I think it's a bit out dated though, I think the newer schedulers ignore out of order delivery.

I've also seen problems on some embedded stacks, but that could easily be argued that the implementation is wrong. But I've seen things like credit card terminals break due to packet reordering.

ggm · on Oct 31, 2022

I can't fault Ousterhout for writing in support of a new(ish) idea but his language here went to "forbidden" when in fact it's just "strongly disliked"

TCP the protocol knows how to re-assemble out of order. What I think he's doing is making it a higher task to do it, outside of the protocol, or else providing some mechanism in user process space, amenable to threading.

I can believe an async model of "tell me when this is complete" would work well with a bitmap/bloom filter type gate on what "has to be complete" to proceed.

I like his writing. I was a fan of tcl/tk and used expect heavily back in the past.

bewo001 · on Oct 31, 2022

IBM AIX's TCP can either do selective ACKs or handle out-of-order packets. Which we were told when a firewall started reordering packets. It simply dropped out of order packets and therefore triggered congestion handling.