Hacker News new | past | comments | ask | show | jobs | submit login
We need a replacement for TCP in the datacenter [pdf] (stanford.edu)
478 points by kristianp on Oct 31, 2022 | hide | past | favorite | 313 comments



Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

What's kind of hilarious about this paper is, these are just the network-layer problems! It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols. And then there's all the non-backend problems!

And that's just TCP. We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.


> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is ridiculous.

Hyperscalars see an immediate ROI from efficiency/reliability improvements and actively invest in TCP alternatives all of the time. It's just really hard.

Networking companies see an ability to differentiate their products from their peers and work on this kind of thing as well. I did a 3 second google for "QUIC acceleration Mellanox" and got a hit on Nvidia's blog right away.

You just can't trivially replace something with an investment totally 50 years of clock time and thousands of years of engineer time. It will either take a long time or a massive shift in needs/technology. FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.


> 50 years of clock time and thousands of years of engineer time

It's not just the size of the investment, it's that it's the protocol everyone uses to talk to other people's machines, and you can't upgrade or replace other people's machines.


In this case we're talking about within the Datacenter, and you could conceivably update every network device and system to talk the new thing if you wanted. This is more achievable at a hyperscalar, where there tends to be < 3 distinct protocols, proxies, etc.

TCP gives you three things: 1. Reasonable performance - This is hard but not impossible to replicate 2. Reliability - This is very hard to replicate because networking edge cases are very hard to isolate 3. Fairness - this one is roughly impossible, because the "fairness" is an artifact of the experimentation and tweaking of Congestion Control Algorithms.

To elaborate on fairness, dynamic traffic control of all flows within a DC while maintaining high utilization is roughly impossible. You can get really close to this by picking your battles wisely (i.e. solid demand control for data warehouse workloads), but you'll always end up counting on individual flows to react appropriately to loss. They need to back off enough to make room for others without tanking their own throughput.

The people who design and implement these algorithms are definitely geniuses, but even they rely on TONS of empirical evidence to narrow parameters to what's appropriate. Of the Kernel Networking people I've worked with, Lawrence Brakmo had the most sophisticated network testing harness I've seen. Even then, you don't really know if it works (and can't finish tuning it) until you run it in production.

Running novel congestion control algorithms in production at a sufficient scale to figure out whether or not they're working appropriately is a great way to kill your network, so we end up conducting the equivalent of CCA drug testing to roll it out slowly and safely.

The end result of all of this is that it's really hard to solve the "arbitrary connections sharing arbitrary network topologies with high utilization" problem quickly enough for it ever to look like a breakthrough rather than just steady progress.

It's also worth noting that it's usually easiest to prove performance, so you'll see a lot of excitement about performance benchmarks from people who don't yet know what they're about to learn about networking. We were very much in this camp at Facebook when we were all-in on memcache-over-udp, and we later abandoned it completely.


After having lived through Amazon's early (pre-2003ish) UDP-based networking I got a laugh around 2006-ish or so reading about how facebook was into UDP. I assume there are people who worked there who still have the scars.


Do you have any specific problems you can elaborate with the UDP ?

UDP used successfully many places.


I’m really looking forwarding to seeing the original commenters reply on this. But I’ll share my experience too.

I’ve found UDP to be great for latency but pretty awful for throughout. Especially over longer routes (ie inter-region transports). Also, if you fire UDP packets out of a machine in a tight loop then there is every chance you could overload various buffers and just loose them (depending on the networking hardware).

TCP is comparatively amazing for throughput, but you do take a latency hit (especially on the initial handshake, which doesn’t exist for UDP).

There are some very experienced people commenting here though, and I’d be happy to be corrected or expanded upon.


Anecdotal, but I've some experience in running both TCP- and UDP-based VPN over long-latency links (I worked from half around the globe for some years).

With OpenVPN it's easy enough to test - configure for UDP, or configure for TCP. With long latency, and a tiny amount of packet losses, running TCP over TCP OpenVPN completely stalls, while TCP over UDP OpenVPN is excellent - it's around the same performance as running direct TCP, or sometimes actually better. At work we've also used other types of VPN setups (for engineers on the road), and the TCP based ones (we've used several) work fine most of the time, but if you try that from far away it becomes nearly unusable while UDP OpenVPN continues to work basically just fine.

The TCP over TCP VPN performance problem (over long latency links) presumably has to do with windowing and ack/nak on top of windowing with ack/nak.


The TCP over TCP performance problem can be summarized as follows:

Because the underlay TCP is lossless (being TCP), every time the overlay TCP has to retransmit, it adds to the queue of things that the underlay TCP has to retransmit (and the need to retransmit happens more or less at the same time).

So instead of linear increase in the number of packets, you get ~quadratic.

This balloons the required throughput needed to “rectify” the issue from the protocol standpoint at both levels - usually precisely at the point when there’s not enough capacity in the first place (the packet loss is supposed to signal congestion).

If you are very lucky, the link recovers fast enough that this ballooning is small enough to be absorbed by the newly available capacity.

If the outage is long enough, the rate of build-up of retransmits exceeds the capacity of the network to send them out - so it never recovers.

Needless to say, the issue is worse with large window in overlay TCP session - e.g. a sudden connectivity blip in the middle of the file transfer.


> I’ve found UDP to be great for latency but pretty awful for throughout.

UDP/multicast can provide excellent throughput. It's the de facto standard for market data on all major financial exchanges. For example, the OPRA feed (which is a consolidated market data feed of all options trading) can easily burst to ~17Gbps. Typically there is a "A" feed and a "B" feed for redundancy. Now you're talking about ~34Gbps of data entering your network for this particular feed.

Also, when network engineers do stress testing with iperf we typically use UDP to avoid issues with TCP/overhead.


That’s interesting. And I’m sure they have some very knowledgable people working for them who may(/will) know things I don’t.

That being said, it wouldn’t surprise me if they were pushing 17G of UDP on 100G transports. Probably with some pretty high-end/expensive network hardware with huge buffers. I.e you can do it if you’ve got the money, but I bet TCP would still have better raw throughput.


Yep, 100G switches are common nowadays since the cost has come down so much, and you can easily carve a port to 4x10G, 4x25G, and 40G. In financial trading you tend to avoid switches with huge buffer as that comes to a huge cost in latency. For example, 2 megabytes of buffer is 1.68ms of latency on a 10G switch which is an eon in trading. Most opt for cut-through switches with shallow buffers measured in 100s of nanoseconds. If you want to get really crazy there are L1 switches that can do 5ns.


That is a really good point that I hadn’t considered. Presumably this comes at the risk of dropped packets if the upstream link becomes saturated? Does one just size the links accordingly to avoid that?


Basically yes, but the links themselves are controlled by the exchanges (and tied in to your general contact for market access).

In general UDP is not a problem in the space because of overprovisioning. Think "algorithms are for people who don't know how to buy more RAM", but with a finicial industry budget behind it.


Do the vendors actually convince anyone to buy those hubs rebranded as "ultra high performance 5ns L1 switches"?


Multicast throughput is hard to measure because it is... well, multicast.

Depending on where your RP's are, and how you are transmitting multicast packets across a core, multicast performance can vary a lot.

The main advantage of multicast however, is that throughput between RP's doesn't need to be very large..


It’s actually pretty easy to monitor the throughout with the right tools. The network capture appliance I use can measure microbursts at 1ms time intervals. With low latency/cut through switches there are limited buffers by design. You are certain to drop packets if you are trying to subscribe to a feed that can burst to 17Gbps on a 10Gbps port.

Market data typically comes from the same RP per exchange in most cases. Some exchanges split them by product type. Typically there’s one or two ingress points (two for redundancy) into your network at a cross connect in a data center.


Have you tried to get inline-timestamping going on those fancy modern NICs that support PPT? Orders of magnitude cheaper than new ingress ports on that appliance whose name starts with a "C", also _really_ cool to have more perspectives on the network than switch monitor sessions.


UDP is little more than IP, so there isn't a technical reason why UDP couldn't be just as fast as TCP _per se_. But from when I was toying with writing a stream abstraction on top of UDP in Linux userspace, I came to the same conclusion, it's hard to achieve high throughput.

My guess is that this is in part because achieving high throughput on IP is hard and in part because it's never going to be super efficient at this level (in userspace, on top of kernel infrastructure that might not be as optimized towards throughput like it is in the case of TCP).


You can use eBPF/DPDK these days for hardware offload.


What about QUIC? Do you think that HTTP/3 will suffer from throughput as well?


UDP is just a protocol. I’ve served millions - even billions - of people with UDP media delivery. I use it all the time for all my work communication (WireGuard)

I wouldn’t use it to ping my gateway though, or to join a multicast group, nor would I use it to establish my bgp session, I use icmp, igmp and tcp for that.


Amazon used UDP over multicast for request/response when sometimes the responses would be very large and implemented reliability on top of that through fall back to UDP unicast. This was all using Tibco RVD (taken from Bezos experience in Finance on the East Coast before Amazon I think).

The really key point there is probably the size of the responses, it wasn't just tiny atomic bits of stock information.

At one point as a system engineer I actually had to bump up the size of UDP socket data that the kernel would allowed to be sent across the entire production set of servers. SWEs were really hammering on UDP hard (the platform framework was sort of "sold" as being better than TCP though which doesn't have those kinds of limits).

The result was that one Christmas the traffic scaled up to the point that the switch buffers were routinely overflowing all the time. There was no slow start in UDP so the large payloads the SWEs were sending would go out as fast as the NICs could send them, which resulted in filling up packet buffers in the 6509s (Sup 720s I think at the time? Whatever it was the network engineers had already upgraded to whatever was Cisco's latest and greatest at the time and had tuned the switch buffers).

What made it even more fun was that as packets were dropped on the multicast routes the unicast replies created a bit of a bandwidth-amplification-attack. Then eventually the switch buffers started dropping IGMP packets, and if you drop enough of those in a row then IGMP sniffing fails and the multicast routes themselves start getting torn down. Now you get "packet loss" on one of the destination nodes which is complete. Then when it eventually rejoins it has fallen far behind all the peers (causing a bunch of issues when it was out of synch to begin with though) and then it requests more unicast messages to get caught up, creating even more of a flood of rapidly-sent UDP.

What I wound up doing is writing scripts to log into all the core switches and dump out the multicast tables and convert the IGMP snooped routes into static routes and reapply them. That let the multicast network grow as the site had to scale for Christmas, but kept all the routes in place and avoided the IGMP route flapping.

But even with that band-aid it still didn't work well and there was still high congestion and packet loss across the core switches. There were also problems with the CPU on the switches and Amazon drafted an extension to how multicast routing was done and got Cisco to implement it ("S,* routing" IDK if that's right its been 20 years). And it was a good job that the Network Engineers had ripped out spanning tree and gone L3 entirely since the packet loss and CPU congestion would have caused spanning tree to flap which would have amplified all the congestion issues. Eventually Tibco RVD was ripped out and a TCP-based gossip-based-clustering protocol was put into place.

So if you use UDP based stuff the datapackets need to be small, or else you need to throttle the senders somehow, and you need to not care about reliability. For stock ticker information it might work well, and for multimedia streaming where the protocol layer above it does slow start and congestion control. I suspect that if you dug up the network engineer responsible for those networks though that they could tell you stories about packet loss. If UDP works well at your company my suspicion is that you've either got a protocol sitting on top of UDP which implements at least half of what TCP offers, and/or you've got an overworked network engineer trying to keep it all together, and/or you just haven't scaled enough yet. I also wouldn't be too surprised if some wall street firms have switched to RDMA-over-Infiniband or something like that with link-layer and end-to-end credit-based based flow control[*] (as this paper points out, though, RMDA has issues itself and doesn't meet all the criteria for a TCP-replacement, but that would at least stop the packet loss issues due to buffers overflowing).

QUIC is a good example of what you need to do in order to use UDP (Section 4 of RFC 9000 is all about Flow Control to prevent fast senders from DoS'ing your network switches). But for the average HN/reddit reader who reads something about how TCP is awful and has the "showerthought" of wondering about why everyone doesn't just switch to UDP in the datacenter, they're missing a massive problem in that Ethernet has no flow control and just promiscuously drops packets everywhere, so if you thoughtless slap UDP on top of that your datacenter will absolutely have a meltdown. You need to use something like QUIC at a minimum.

And buried in what I wrote above is an observation that UDP multicast doesn't really solve reliable delivery across multiple servers and failover of streams that you'd like to be able to see, that's another solution which is simple and wrong (and which it looks like Homa is trying to address).

[*] On second thought they probably massively overprovision their network since mostly they just care in the extreme about latency at the expense of everything else (which is a very unusual use case).


> Ethernet has no flow control

Isn't this what pause frames and Pfc are for? (Honest question)


Multicast storms happened regularly back in 2004


True, there were tons of crappy hardware still in production at that time. The first job I had out of college consisted of crappy 3Com hubs (not switches) so something like Norton Ghost could take down the whole network since multicast would get flooded everywhere. Nowadays this is a less of a problem as hubs are long gone and most switches have IGMP snooping by default and would only forward mutlicast frames that someone wants.

A bad client can still cause problems though, like sending a high rate of multicast packets with a TTL of 1.


Amazon was definitely not run off of crappy 3com hubs, not even back then.


> In this case we're talking about within the Datacenter

Oh gotcha. It's right there in the title, but missed it somehow :p


Yes you can. Just offer a better product, and people will buy it instead of the old or bad product. Better yet, make the new product backwards compatible, and fewer people will have qualms about forking out for it. Better yet, do an aggressive takeover, like Microsoft did, and just force the entire industry to adopt your stuff...


Great! When do you think you'll have it done?


Done? What do you mean "done"? Consulting hours are much better on projects that cannot ever be finished!1


You mean like IPv6?


I think QUIC/http2 is a much better example.

Google made that happen almost unilaterally via their Chrome dominance.


I mean, this is how new features come about, for the most part (look at ajax, from Microsoft's IE dominance). The consortium allows anyone to contribute, not just the dominate browser, but the dominate browser will always be able to experiment with new web features without having to discuss it with anyone.


> FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.

Maybe we were just early in giving (HFT) customers RDMA back in ~2007[1][2] but I don't see it entering the mainstream anytime soon. And after a relatively short 20 years of adoption, the "next" thing for hyperscalers is not going to be the next thing for everyone else.

[1] https://downloads.openfabrics.org/Media/IB_LowLatencyForum_2...

[2] https://www.thetradenews.com/wombat-and-voltaire-break-milli...


HFT networks are also a lot smaller than hyperscaler datacenters, and designed with more cross-sectional bandwidth. A good chunk of the traffic (trading-related messages) also tends to not use congestion control.

In large web company datacenters, RDMA and RoCE have had a much "rockier" path forward.


> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is severe bullshit on two fronts:

- there is an immediate return on value - Google was driving this a decade+ ago for improvements in the data center (things like doubled+cancelable rpc, tcp cubic, quic, etc)

- academia constantly attempts to make these improvements as well because researchers are super incentivized to dethrone tcp for the glory. There are constant attempts to re-invent various layers (IP, tcp, the non existent upper layers of the OSI, etc) that come out of academic conferences every year.

The reason we’re still here is because our current stacks have been heavily optimized and tooled for production workloads. NICs can transparently re-assemble TCP segments for the OS and they can segment before transmit. You have to have a damn good value prop to throw away everything from software and hardware to careers and curriculum. It has to be a shitload better than the security nightmare of “return back each layer of the stack”.


I don't think you realise why this is so hard.

The basic reason is that software at every level expects TCP/IP. And you can't drop in a translation layer because it will require at least the same amount of overhead as "real" TCP/IP.

It is not a local problem, it is a global problem that affects basically every single piece of non-trivial software in existence.

Even if you construct your datacenter with the new protocol you will run into problems that you can't run anything in it. Want Python? Sorry, have to rewrite it. And every Python library. And every Python application. Then you need to deal with problems that people who can run their scripts on their machines can't run them in datacenter. And so on.

The reason nobody wants to do this is that they would be investing huge amount of money to solve a problem for everybody else. Because the only way to make TCP/IP replacement work is to make it completely free and available to everybody.

There are much better ways to allocate your funds and precious top level engineers that let them distance themselves from competition temporarily.


> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

And yet every time hardware designers get the chance they redesign Ethernet and IPv4--poorly.

See: HDMI 2.0+, USB 3.0+, Thunderbolt 3.0+, etc.

My suspicion is that this paper works fine beween pairs of peers and immediately goes straight to hell after that. It is extremely suspicious that there is zero mention of SCTP and only compares to TCP and not UDP.

The problem with RPC is that multiple organizations must agree on meaning. And that's just not going to fly. It is damn near a miracle that a huge number of institutions all agree on the Ethernet/IP command "Please take this bag of bytes closer to the machine named: <string of bytes>."


Why do you say that these protocols are worse than Ethernet/IPv4? I'm not intimately familiar with any at L2/L3, but I don't think any have hacks as bad as ARP. (USB does have some weirdness at L1 though I know.)


I’ve generally considered IPv6 neighbor discovery to be a worse hack than ARP. ARP is a straightforward, fairly clean hack to layer the IPv4 addressing scheme over Ethernet, and it doesn’t pollute IPv4 itself. Neighbor discovery layers IPv6 on top of pseudo-IPv6, where the latter operates without knowledge of MAC addresses but nonetheless hardcodes knowledge of Ethernet. But hey, it eliminated the use of Ethernet broadcast in favor of a more complex but functionally identical multicast scheme.


Oh sure, the point is more that Ethernet/IP has to coordinate two separate ID spaces at all, whereas AFAIK no other packet-based protocol like the ones mentioned does this, so in that sense those protocols are better.


> Why do you say that these protocols are worse than Ethernet/IPv4?

Here's an example: I connected my nice expensive audio interface to my Thunderbolt port. It worked great! Then I moved a window on my monitor and all hell broke loose. In spite of the fact that it had way more than enough bandwidth to handle everything.

See, Thunderbolt doesn't have the ability to say "This tiny packet going to there needs priority and you need to break up those giant display packets."

Ethernet has solved problems like these in standards. They're not always implemented on particular chipsets, but they exist, and you generally can buy a product that has them.

Everything Ethernet has done and standardized has generally been for a reason. If you don't implement Ethernet, then you are starting over from scratch and will have to reimplement all of that stuff.

And you're probably not smarter than the guys who did it for Ethernet.

(If I'm being charitable: what's happened is that lot of standards tried to be more cost optimized than Ethernet. The problem is that transistor prices keep coming down. Eventually the price delta between Ethernet and <whatever> becomes inconsequential and you're basically left with real Ethernet and "kinda crappy" Ethernet at almost the same price.)


Never thought about that before. Ethernet supports extremely high levels of data transmission. ISB C for intrgrated charging ND data transfer makes sense, but why are there HDMI cables?


I think it would be an overkill to use a networking protocol to connect exactly two devices. Plus if you have something specific to video stream transfer you could maybe do some optimization specific to that use case, although I can't think of any at the moment.


Excepting HDMI the parents examples are all networks with more than one peer. Thunderbolt and USB3 can both have arbitrary trees of nodes.


The parent comment was specifically wondering about HDMI, the other examples are given saying that those have some reason to exist, while he didn't see any reason for having HDMI instead of recycling some other protocol.


> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols

I think this is more of an artefact of horizontal scaling and port-contention. De-facto standard discovery mechanism DNS does not work with ports, so "well-known port" abstraction kinda fails. Http as tunnel mostly avoids/sidesteps this problem.

> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them.

This is weird take or I don't understand it. If you can communicate with an edge node in another network, but the edge node has issues communicating with some inner node (on your behalf), then, as a user, you have no hope of fixing that connectivity issue anyway, regardless of whether layered approach is used or not. This may be related to previous point about http as universal tunnel. Yes, this is a problem, but in a way that communications are effectively terminated at the edge node and monstrosity of stuff happens behind the scenes


> De-facto standard discovery mechanism DNS does not work with ports

Yes, it does, see SRV records.


I meant DNS A/AAAA queries with preconfigured/well-known ports being the default. While some applications/protocols/services do use some port discovery mechanism, I would argue it is nowhere close to being de-facto standard.


So true, but how many developers know about them? The API situation does not help either.


I would say ports are mainly a problem on layers below transport even though some tech overuse ports.


> Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

If this was true then how do you explain that the likes of AWS, the same company who ended up investing in developing their own processor line, doesn't seem to agree that none of the pet peeves you mentioned are worth fixing?



It's not obvious to me that replacing TCP really is harder than designing your "own" chip. Scarequotes here because those graviton chips (that's what you're referring to, I think?) are of course ARM chips, so they're not designing something fresh; they're adapting a very mature design to their own needs. In terms of interoperability, a custom chip based on a standard design is probably a simpler, more locally addressable problem than new network protocols.

Isn't it plausible that graviton was designed yet TCP retained simply because graviton as a project is easier to complete successfully?


> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

I work in a company that builds network troubleshooting/observability tools and we have some pretty experienced analysts to tell you what's wrong with the network. With that context, your idea of having any tool automatically diagnosing network issues is a pipe dream.

The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners, failures aren't always easily reproducible, and with human configuration (and therefore errors) almost every step of the way. Even if a tool that "returns each layer of the stack" would be useful, it still would be far from enough to diagnose issues.


"The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners" Ah, how people forget the early days of networking. I remember vividly the early days of the Networld/Interop trade show - Interop was in the name because if, as a vendor, your equipment couldn't integrate with the show network they would throw your booth off the show floor.

That's how bad interoperability in the early days was!


Every major corporation has multiple research organizations doing nothing but invest in things that don't have immediate shareholder value.

What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale. The scale at which these changes needs to happen to be practical are at the cluster level in a single data center. If you can propose something that fits that bill there are a few companies willing to pay you millions in salary as an engineering fellow to do it.


> What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale.

I'd put it differently: it's paying up hundreds of billions in infrastructure to have some sort of gain.

And which gain is that exactly?

I see a lot of "the world is dumb but I am smart" comments in this thread but I saw no one presenting any clear advantage or performance improvement claim about hypothetical replacements. I see a lot of "we need to rewrite things" comments but not a single case being made with a clear tradeoff being presented. Every single criticism of TCP/IP in this thread sounds like change for the sake of change, and changes that aren't even presented with tangible improvements in mind or a clear performance gain.

Wouldn't that explain why TCP is so prevalent, and no one in their right mind thinks of replacing it?


I mean the goal is more performance, especially if you can get more performance out of the same hardware. Faster setup times, faster connections, more connections, maybe faster teardown. Lower contention on saturated links. Inside of the datacenter is a controlled environment where something like that could work. Replacing TCP over the Internet at large is going to be an uphill battle. Still, if we're replacing the whole thing, then simpler code on the client and server end would be nice.


> I mean the goal is more performance, especially if you can get more performance out of the same hardware.

Are there actual numbers demonstrating this?

I mean, people are advocating wasting billions revamping infrastructure. What kind of performance are you hoping to buy with those billions? And are those gains worth it, or is just sake for the sake of change?

Sometimes things are indeed good enough.


> Every single criticism of TCP/IP in this thread sounds like change for the sake of change, and changes that aren't even presented with tangible improvements in mind or a clear performance gain.

It amuses me that many of those saying "we need to change" are the same ones that bemoan it when car manufacturers remove buttons or make glove boxes operational from a touch screen because they can.


It's figure 1 in the paper. Homa is over 10x faster than TCP (presumably CUBIC).


> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols. And then there's all the non-backend problems!

The paper argues the in '3.1 Stream orientation' section, that stream orientation is a problem for TCP, and says that most apps send messages instead, and the better protocol should handle messages, natively, etc. Which is a good point I think.

But back to TCP. What do you do, if you need to send Messages between applications in TCP? Preferably those Messages would be encrypted also.

You could make up your own protocol, but you probably would rather not! So you use something that is readily available, and does messages, encryption, etc. Would be nice if there were also a ready to use load balancers, caches, tools to debug it, etc

Now, what would be such a protocol.

Why HTTPS, of course.

So I kind of think that the lack of a low level Message Protocol has lead us, as an industry, to coalesce these features bit-by-bit on top of HTTP. It's not perfect by any means, but it does the job.


HTTPS adds a tremendous amount of overhead to give you messaging. It's a lot better from a hyperscaler's perspective to replace TCP and not use the byte stream abstraction. After all, networks send messages. It's silly to throw that away at one layer and try to get it back at the next layer.


Surely most of your ideas are already being deployed in QUIC/HTTP3. It just happens inside a UDP datagram, for compatibility. Really you're not going to see any new IP protocol layers, there's too much quirky hardware on the network that wouldn't be able to handle it. If we can't even get IPv6 to work all the way to the client, we're never seeing new values for the protocol byte.


Don’t the hyperscaled cloud providers run totally segmented networks? What’s stopping them from using something proprietary internally and just exposing TCP at the end for termination of client connections?


Google already does that.


I’m not aware of them using something other than TCP internally (I’m sure by now they’ve migrated to QUIC but I’m not sure that QUIC necessarily solves some of the scaling challenges / optimizes for gRPC and low latency).


Google is using remote memory accesses rather than TCP for at least some classes of traffic (e.g. a caching system). They've been publishing details about how it all works too.

Also, they have a transport (Pony express) developed specifically for RPCs, rather than byte streams or datagrams.

Links: https://research.google/pubs/pub51341/, https://research.google/pubs/pub50590/, https://research.google/pubs/pub48630/, more generally https://research.google/pubs/?area=networking


Can someone ELI5 how remote memory access works?


I could be wrong but I believe they have a unified address space. There’s dedicated hardware that then owns a given memory range. On an access it will fetch it from the remote location matching that address on demand and store it in real memory in space allocated to it. Presumably it evicts stuff if there’s insufficient memory. Once the memory is brought over either a virtual address range is remapped to point to main memory or the ASIC just has a TLB itself.

This is pure speculation based on seeing the word ASIC in one of the summaries but it seems like it could be reasonable.


I don't think Google-internal communications happen over gRPC. Maybe the protocol was design with an ambition to replace their internal RPC system but it probably failed at that.

They have a new system called Snap although judging from the paper I don't think it can completely replace TCP: https://research.google/pubs/pub48630/ My understanding is that Snap enables new use cases including moving functionality previously done via RPCs to RDMA-like one-sided operations. I think it is complement to RPCs but does not replace it.


They do, it's called DCTCP. Although it's actually an open standard.


Your ideas are interesting, can you link to or explain a concrete example though? The idea of everything magically debugging itself doesn't apply to a single piece of software I've ever seen, so I'm curious what kind of design would lead to that being possible.


Heres an example of an improvement to sending large files over long distances -- Tsunami protocol. It tries to get a best of both worlds to limit the detrimental effect of synchronous roundtrips in the TCP protocol for file transfers:

https://tsunami-udp.sourceforge.net/


> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP

If you have ever used multiple TCP or UDP connections in parallel on a single machine (doesn't matter if server or client) then you should realize that ports are required.

Apart from that, you can run HTTP on other ports than 80. You can also use HTTP to load balance or do service discovery by means of redirects. (Caveat, I don't work in this field and can't say how solid the approach works in practice).


> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is just not true. Stuff needs to be funded and worth doing, and the internet, like almost everything, is built on making things worth paying for, but there are also loads of improvements everywhere are being made.


What is QUIC in your book?

Given, say, $50 million of dev time, what would you go about fixing? And in what way?


In addition to QUIC, KCP [1] is another reliable low-latency protocol that sits on top of UDP that might be interesting. And unlike RFC 9000/9001 (QUIC), encryption is optional. I haven't really seen it mentioned much outside of primarily China-focused projects, like V2Ray [2], but there is also some English information in their Git repo [2].

[1]: <https://github.com/skywind3000/kcp>

[2]: <https://www.v2fly.org/en_US/>

[3]: <https://github.com/skywind3000/kcp/blob/master/README.en.md>


KCP uses a brute force congestion control algorithm that is unfair and inefficient. It is also poorly specified, which is probably why it is less commonly used outside circumvention circles.


KCP is notably used by the popular mobile game Genshin Impact.


Still is looks interesting for some use cases, even though it's not fair if it's fully utilized on the internet.


IMHO QUIC is nice, but a disappointment, since it could have been so much more.

Does not handle unreliable messages, still only (multi)streaming, no direct support for multicast, 0-rtt which need a lot of stuff to be manually done TheRightWay or risk amplification attacks, the (imho) under-researched (and removed) forward error correction, and more.

I just restarted working on what I consider to be the solution to this, federated authentication and a bit more, but $50M is too far to be even a dream since I am not google.


Doesn't QUIC still run over TCP? I thought it was a replacement for HTTP not TCP (Edit: looks like it replaces TCP and HTTP)


QUIC runs over UDP, and provides streams and encryption. HTTP/3 is designed to take advantage of QUIC streams (replacing HTTP/2 streams which were problematic due to TCP head of line blocking).

The RFCs are a bit elaborate so folks interested might want to look at this instead[1], which has one of the RFC authors explaining the basics of QUIC and HTTP/3.

[1] https://www.youtube.com/watch?v=cdb7M37o9sU


It replaces TCP+TLS, and runs multiple streams on the same conn, supports transition from eg wifi to ethernet on at least one of the nodes. And since it's over udp implementations are mostly in user space. Which is good if you want it now, but not great for performance. Ip packets are very small so you gotta have either kernel support for quic or batch IO, otherwise it's often CPU limited (yes, really). In addition congestion control is wonky, unfortunately. In my experience (quic-go), it's too shy in the presence of TCP streams, which ends up getting more bandwidth. But that depends on the algorithm used, implementation and God knows what else.


I guess you were thinking about another clever name protocol, SPDY :-)

SPDY → HTTP/2

QUIC → HTTP/3


Nope, UDP.


> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

I would affirm that. It's imho true for almost everything in IT tech.

How computers "work" today is just pure madness when looked anyhow closer.

Everything's a result of some "historic accidents" back in the days, and from that the usual race to the bottom caused by market powers.

Nobody is willing to touch any of the lower layers no matter how crazy they are form today's viewpoint. We just shovel new layers on top to paper over the mistakes of the past. Nothing gets repaired, or actually what would be more more important, rethought form the ground up in light of new technological possibilities and changed requirements.

I understand from the economic standpoint how this comes. But I'm also quite sure we didn't make any fundamental improvements in the last 50 years of computing.

That's a very bad sign when everything in a field that's not even really 100 years old is frozen in time since 50 years because everything's so fragile and complex that fundamental changes aren't possible. This looks like a text book example of a house of cards…

Given how vital IT tech is to modern life I fear that this will crash at some point in the worst way possible.

And even if it won't crash, which is really strongly hope, we will never have nice things again as nothing of the old rotten things can be reasonably changed.


> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues

Thats virtual networking. but that introduces latency if its not well configured.

> But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf

not really, assuming you have the right fabric, its nowhere near as hard as that. Plus you seem to be forgetting that there is more to the network than TCP. There is a whole physical layer that has lots of semantics that greatly affect how easy it is to debug higher levels.


> We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

This violates the principle of encapsulation that the entire field of networking is based on, not to mention an massive security hole.


I’ll defer to experts on the network-layer problems but im not sure what you see as the problem with converging on HTTP. It’s awkward and inelegant, but as an a backend application developer I never feel like it gets in my way.


> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols.

Nobody forces them though. It would be much easier to publish a standard port number mapping than to develop a (or multiple) new protocols. Now you just need to motivate people to use it.


At last hopefully there is light at the end of the tunnel. Big question for me is who is going to build it?


I get where this is coming from, but no. We don't need to replace TCP in the datacentre.

Why?

because for things that are low latency, need rigid flow control, or other 99.99% utilisation case, one doesn't use TCP. (Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

Look if it really was that much of a problem then everyone in datacentres would move to RDMA over infiniband. For shared memory clusters, thats what's been done for years. but for general purpose computing its pretty rare. Most of the time its not worth the effort. Infiniband is cheap now, so its not that hard to deploy RDMA[1] type interconnects. Having a reliable layer2 with inbuilt flow control solves a number of issues, even if you are just slamming IP over the top.

shit, even 25/100gig is cheap now. so most of your problems can be solved by putting extra nics in your servers and have fancypants distributed LACP type setups on your top of rack/core network.

The biggest issue is that its not the network that's constraining throughput, it either processing or some other non network IO.

[1]I mean it is hard, but not as hard as implementing a brand new protocol and expecting it to be usable and debuggable.


The reality of today's large datacenters is that almost all of them have almost all of their traffic on TCP unless the owners of the datacenter have made a conscious effort to not use TCP. The highest-traffic applications, usually databases and storage systems, pretty much all use TCP unless you are buying a purpose-built HPC scale-out storage system (like a Lustre cluster). Most people who build a datacenter today use databases or object stores for storage, not Lustre or dedicated fiber channel SANs. On top of that, pub/sub systems all use TCP today, logging tends to be TCP, etc.


Fibre channel is dead, long live fibre channel.

I agree a lot of things are on TCP, but I don't think its a massive problem, unless you are running close to the limit of your core network. And one solution to that is to upgrade your core network....

Failing that, implementing some load balancing/partitioning systems to make sure data-processing affinity is best matched. This the better solution, because it yields other advantages as well. But its not the easiest, unless you have a good scheduler


I will also add that one of the big problems with TCP is that it is impossible to load balance without knowledge of the L4 protocol. You can't load balance a byte stream. That means writing your own load balancer unless you want to also accept http overhead.


What is supposed to be a good scheduler?

Genuine question as I (as a software dev) have no clue how modern DCs are build in detail.


Its very much down to your workload and how you want it to work.

Short answer: k8s/fargate/ECS/batch will do what most people want. Personally I'd steer clear of k8s until you 100% need that overhead. Managed services are ok.

Long answer:

K8s has a whole bunch of scheduling algorithms but its a jack of all trades, and only really deals with very shallow dependencies (there are plugins but I've not used them). For < 100 machines it works well enough. (by machine I mean either a big VM or physical machine). Once you get more than that, K8S is a bit of a liability. its chatty, slow and noisy. It also is very opinionated. And let us not start on the networking.

Like most things there are tradeoffs. Do you want to prioritise resiliency/uptime over efficiency? do you want to have batch processing? do you want it to manage complex dependencies (ie service x needs 15 other services to run first, which then need 40 other services) are you running on unreliable infra (ie spot instances to save money) do you need to partition services based on security? are you running over many datacentres and need the concept of affinity?

More detail:

The scheduler/dispatcher is almost always deigned to run the specific types of workload that you as a company run. The caveat being that this only applies if you are running multiple datacentres/regions with thousands of machines. Google have a need to run both realtime and batch processing. But as they have millions of servers, making sure that all machines are running at 100% utilisation (or as close to it as practical) is worth hundreds of millions. Its the same with facebook.

Netflix I guess has a slightly different setup as they are network IO orientated, so they are all about data affinity, and cache modelling. For them making sure that the edge serves as much as possible is a real cost saving, as bulk bandwidth transfer is expensive. The rest is all machine learning and transcoding I suspect (but that's a guess)


Thanks for the reply!

Not really an answer to the scheduler question, but at least it mirrors some of my experience.

That K8s is something to avoid, and that it does not scale, is a known (at least to me).

But that doesn't answer what people would put on the metal when building DCs…

I was not asking out of the perspective of an end-user. I was asking about (large) DC scale infra. (As dev I know the end-user stuff).

As I see it: You can build your own stuff from scratch (which is not realistic in most cases I guess), or you can use OpenStack or Mesos. There are no more alternatives at the moment I think, and it's unlikely that someone comes up with something new. OTOHS that's OK. A lot of people will never need to build their own DC(s). For smaller setups (say one or two racks) there are more options of course. (You could run for example Proxmox and maybe something on top).


Sorry yeah, I didn't really answer your question.

Here is a non-exhaustive list of schedulers for differing use cases:

https://slurm.schedmd.com/documentation.html << mostly batch, not entirely convinced it actually scales that well

https://www.altair.com/grid-engine << originally it was sun's grid engine. There are a number of offshoots tuned for different scenarios.

https://rmanwiki.pixar.com/display/TRA/Tractor+2 << thats for the VFX crowd

https://www.opencue.io/ << open source which is vaguely related to the above

https://abelay.github.io/6828seminar/papers/hazelwood:ml.pdf << Facebook's version of airflow. It sits on top of two schedulers, so its not really a good fit. I can;t find what they publicly call their version of the borg

I'm assuming you've read about borg

As you've pointed out mesos is there as well.


Cool! Thanks! That's a lot of stuff I didn't hear about until now.

It's really nice that one can meet experts here on HN and get free valuable answers form them. Thank you.


You're missing the fact that Stanford is the farm team for Google and Google is hyperscale. At scale, your "just spend more money" solutions are in fact more expensive than creating a new protocol. And like k8s, the new protocol can be sold to startups so they can "be like Google".


You're missing the point that maybe, just maybe, I'm part of a team that looks after >5 million servers.

You might also divine that while TCP can be a problem, a bigger problem is data affinity. Shuttling data from a next door rack costs less than one that's in the next door hall, and significantly less than the datacentre over. With each internal hop, the risk of congestion increases.

You might also divine that changing everything from TCP to a new, untested protocol across all services, with all that associated engineering effort, plus translation latency, might not be worth it. Especially as now all your observability and protocol routing tools don't work.

quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

As a side note "oh but they are google" is not a selling point. Google has google problems half of which are things related to their performance/promotion system which penalises incremental changes in favour of $NEW_THING. HTTP2.0 was also a largely google effort designed to tackle latency over lossy network connections. which it fundamentally didn't do because a whole bunch of people didn't understand how TCP worked and were shocked to find out that mobile performance was shit.


> a bigger problem is data affinity

For future, please write about how typical cloud customers can design for better data affinity.

Or is it just handled by the provider?

FWIW, at a prev gig, knowing nothing about nothing, I finally persuaded our team to colocate a Redis process on each of our EC2 instances (along side the http servers). Quick & dirty solution to meet our PHBs silly P99 requirements (for a bog standard ecommerce site).

Apologies for belated, noob question.


> quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

So your 5M machines / 40 in the best case of all 1U boxes is 125K TOR-switch-SWE-week-equivalents / 52 weeks in a year which comes to 2K SWE-years to invest in new protocols, observability, and testing. Google got to the scale they are by explicitly spending on SWE-hours instead of Cisco.


> explicitly spending on SWE-hours instead of Cisco.

I strongly doubt that TOR switches are cisco


but to answer your further case. The point is you don't need to replace all the TOR switches. Only the ones that deal with high network IO.

to change protocol you need gateways/loadbalancers either at the edge of the DC just after the public end points, or in the "high speed" areas that are running high network IO. For that to work, you'll need to show its worth the engineering effort/maintenance/latency.


Google does not use K8s internally.

They never did, they won't ever do that!

K8s does not scale. Especially not to "Google scale".

First step to "be like Google" would be to ditch all that (docker-like) "container" madness and just compile static binaries. Than use something like Mesos to distribute workloads. Build literally everything as custom made on purpose solutions, and avoid mostly anything off the shelf.

"Being like Google" means not using any third party cloud stuff, but build your own in-house.

But this advice wouldn't sell GCP accounts. So Google does not tell you that. They telling you instead some marketing balderdash "how to be like Google".


AWS is True HyperScale. Even more so than Google. And yet their spend more money solution on hardware seems to work fine.


Do we know for a fact that AWS does or doesn't use TCP on their backend? https://news.ycombinator.com/item?id=33402364 leads me to believe Google doesn't.


God, I love it when the talk turns hyper-technical around here, and the Jedi masters turn up.


The paper explicitly addresses Infiniband.


not really. they conflate infiniband with RoCE which given they have different semantics on congestion control, I'd say is a bit of a whoopsey.

if they are using RoCE, are they using DCB to avoid loss(well make it "lossless")? the paper implies otherwise.


For those who don’t know, RoCE is somewhat of a failure in the marketplace right now.


For those who don't know, RoCE = RDMA over Converged Ethernet.

* https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

> RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.

* https://docs.nvidia.com/networking/pages/viewpage.action?pag...


IB does not work TCP/IP by default. You can either run TCP over IB, which has a performance penalty, or you can directly run in Ethernet mode, which is something completely different.


> The biggest issue is that its not the network that's constraining throughput, it either processing

To be fair the paper talks a bit about how TCP makes multithreading slower compared to a message based system.


> Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

So storage doesn't use TCP, except for the protocols that are actually used, which do use TCP?


Depends on what you are using, for connecting block stores, you'll use some sort of fabric. That is Fibre channel, SAS, NVME over something or other

If you are using GPFS, then you can do stuff over IB, but I don't know how that works. Lustre I imagine does lustre things over RDMA.

For everything else, NFS all the things. pNFS means that you can just throw servers at the problem and let the network figure it out.

But again, if IO speed is critical, you move IO over to a dedicated fabric of somesort. for most thing NFS is good enough. (except databases, its possible but not great. but then depending on your docker setup, you might be kneecaping your performance because overlayfs is causing io amplification)


> The data model for TCP is a stream of bytes. However, this is not the right data model for most datacenter applications. Datacenter applications typically exchange discrete messages to implement remote procedure calls

This isn't just a datacenter problem. Every single network protocol I've ever created or implemented is message based, not stream based. Every messaging system. Every video game. Every RPC transport.

But, because we can't have nice things, message framing has to be re-implemented on top of TCP in a different, custom way by every single protocol. I've basically got message framing-over-TCP in muscle memory at this point, in each of the variants you commonly see.

The only kinda-sorta exceptions I know about are HTTP/1.1 and telnet. But HTTP/1.1 is still a message oriented protocol; just with file-sized messages. (And even this stops being true with http2 anyway).

In my opinion, the real problem is the idea that "everything is a file". Byte streams aren't a very useful abstraction. "Everything is a stream of messages" would be a far better base metaphor for computing.


SCTP [1] is there to provide a reliable message based protocol. And it does work inside a datacenter. The issue is outside the datacenter: it doesn't work reliably across the Internet due to middle boxes dumping anything not TCP or UDP...

But inside a controlled environment like a datacenter, it works. It's been used in the telecommunication world to carry control messages in the radio access and core networks for example. So it's been tested at scale for critical applications.

[1] https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...


> The only kinda-sorta exceptions I know about are HTTP/1.1 and telnet. But HTTP/1.1 is still a message oriented protocol; just with file-sized messages. (And even this stops being true with http2 anyway).

No, HTTP/2 and QUIC do not change the semantics of HTTP.

Also, you can have endless streams with HTTP/1.1: just use chunked encoding to POST/PUT and use Range: bytes=0- and chunked encoding for GET and chunked encoding for POST response bodies. In HTTP/2 there's only the equivalent of chunked encoding -- there's no definite content length in HTTP/2.


Correct me if I’m wrong, but doesn’t h2 still break up requests and responses into smaller message frames in order to do multiplexing?

Those message frames are what I’m talking about - as I understand it, they are, yet again, a message oriented protocol layered on top of tcp.


You will always need _some_ framing if you're dealing with bulk data.

If it's not bulk, then you don't need framing if everything can fit into one datagram / whatever transfer unit provided by the transport, but the transport itself will need some framing, especially if it will need to support any kind of fragmentation.

If it's not bulk and it doesn't fit in a datagram / whatever transfer unit provided by the transport and the transport doesn't do fragmentation, then you have to do framing yourself, and then you have a sequencing problem, and so on, and you quickly re-invent parts of TCP but at the application layer.

Basically, it seems inescapable that the Internet is based on packets, and that packets are limited in size, and so application protocols have to be smeared onto packets.

Things are only ever trivial when you're doing request/response protocols with always- or mostly-small requests and responses. The moment you need anything that doesn't fit in the path MTU minus overhead, you need framing.

So I don't think that an octet stream abstraction is quaint and obsolete.


Yes, and even HTTP/1.1 does that. I forget if telnet does something similar, but I suspect it must because it can send control data. The FTP protocol, and the BSD r-command might be the only ones that truly do no additional framing (FTP for data connections, r-commands post-login).


Once chunked encoding is in the picture, even HTTP/1.1 sends messages, not streams, under the hood.


Correct.


I don't think so, or we disagree on the meaning of words. Chunking is not message-oriented in the same sense that UDP is.

With chunking, you basically just insert markers into the stream; this does not imply by any mean that the stream has been split in multiple messages - as a matter of this is taken care of by the lower levels of the client/server, the middle/higher levels certainly don't want to deal with it. This only is a perverted solution to the problem of dynamically generated "messages" (mainly HTML pages), that has been further perverted to implement gruesomely message-oriented "protocols" (Comet and others, IIRC).

UDP, on the other hand is based on datagrams. They can be split into smaller packets on the wire but they are reassembled at the network stack level so no program can even see it happened unless they insist on it.

Websocket is much closer to a message oriented protocol over a streaming pipe than HTTP chunking is.


> I don't think so, or we disagree on the meaning of words.

Well, when there's four people in the conversation, that happens.

u/josephg's complaint at the top of this thread is that people use TCP but still have to add framing in their application protocols. u/josephg said something to the effect of how few protocols do no framing and mentioned HTTP/1, but even HTTP/1.1 w/ chunked transfer encoding adds framing, and even HTTP/1.0 w/ definite content-length also has framing (CRLFs) for the request headers themselves, and effectively frames bodies with CRLF at the start and EOF at the end.

Chunks are definitely not datagrams, just as TLS records aren't either, and just as TCP segments aren't either. But they have framing, which u/josephg complained about.

Framing of some sort is unavoidable. My point, besides the inevitability of application-layer framing, is that a datagram- or message-oriented transport won't make things trivial for apps anymore than TCP did.


> In my opinion, the real problem is the idea that "everything is a file".

Files are just an indexed list of bytes that can represent anything. Think of them as objects.

> Byte streams aren't a very useful abstraction. "Everything is a stream of messages" would be a far better base metaphor for computing.

I don't understand this. A stream of messages is a stream of bytes. Its bytes all the way down.


Because most protocols can handle message loss, with retransmit and proper ordering ? And we haven't started to talk about congestion yet… TCP is useful, and while I'd like to see a message get rid of one (or more) of those constraints to go with a a custom protocol, I feel like they'd be re-implementing the features in the end because these are very useful properties to have…

Edit: the proposal in the article is actually quite, sensible, but requires redesigning your apps… And I'd like to see how it performs: TCP is a hugely optimized beast (when it works well), with hardware offloads, kernel optimizations, etc.


[flagged]


Very peculiar spam. Does anyone have a theory what the motive is?


Modern equivalent of a Numbers Station https://en.wikipedia.org/wiki/Numbers_station


Even TCP itself uses discrete messages under the hood :-)


Hah yes; although TCP frames can be arbitrarily refragmented and rejoined as they travel through the network before reaching your destination.

If you ever play an indie game which seems unusually janky over wifi, its probably because the code isn't correctly rejoining fragmented network packets at the application level. Ethernet is remarkably well behaved in this regard. Wifi is much better at shaking out buggy code.


Finding the right abstraction isn’t easy in a network stack. TCP’s is less useful for application logic but it reflects the way the protocol works internally. Bunch of Bytes goes in on one side, Bytes stream out on the other side in chunks whose size depends on congestion, physical layer and other facts. A message based API hides these facts or leaks them depending on how you look at it.


What’s the issue with fitting messages in streams?


One of the issues I can think of is head-of-line blocking [0].

If you're sending messages of different priorities over the same channel, an error in sending a low-priority message, high-priority messages will have to wait until the low-priority message is properly re-transmitted.

[0] https://en.wikipedia.org/wiki/Head-of-line_blocking

https://en.wikipedia.org/wiki/Head-of-line_blocking


Yes, but this need happen only when your data center network is congested, which is hopefully rare and is relatively cheap to fix by adding capacity. And in congestion cases you need TCP's back off ability.

Getting rid of head-of-line blocking also makes messages happen in order. Making messages (sometimes) happen out of order would drastically increase implementation complexity for a lot of apps.


I don't bother implementing that crap anymore. I just websockets which are message oriented out of the box.


What do you make to require implementing or making network protocols?


Multiplayer video games, realtime updating webpages, database bindings, sync protocols (I play around around with CRDTs a lot), p2p distributed systems, whatever really!

Implementing a wire protocol seems to come up about once every couple of years. And for context, I've been programming now for about 30 years.


A few years ago (when QUIC was coming out) I was developing the theory of a new transport/encryption/authentication protocol. The focus was as much on transport as in the built-in federated authentication.

There was not much interest in the field and I had a lot of the theory and formal proofs, but no implementation.

This month I found a cofounder and we are reordering a lot of the information and presentation, we should start asking for funds in more or less a month.

I still believe my solution to be much more complete than anything in use today (again: on paper), but since there seems to be some interest today, I'll ask here: Can anyone suggest some seed funds to check for a starttup? We will be based half US, half EU.

For more details, fenrirproject.org (again: old stuff there, ignore the broken code)


Very interesting stuff, but not going to lie, I have trouble imagining how you'd build a viable company around this kind of thing.

Infrastructure/protocol companies are always going to be a very tough sell (Sandstorm) unless there's a compelling freemium model like with GitLab, Cloudbees, Sentry, etc.


The infrastructure/protocol will need to remain open, since this kind of thing works better the bigger the user base is. it will probably spin off to its own foundation as soon as it is viable.

The income will come from another project built directly on this, on managing the domain and its users/devices, plus other stuff, mainly for businesses.

I don't see much need to go into details right now, but we have a clear distinction in mind between what is infrastructure and what will be the product.

Again, still in the housekeeping phase, just looking for potential future funds once we finish this phase


Why don’t you actually build gasp a prototype before asking for money


yeah, thank you for the kind comments about not needing money (aka: my time as no value) and asking me to build the prototype with irony.

As I said, the project was started a few years back, and since I did not have the time to work on it, maybe it means my life does not give me the time and money to build this on the side.

But I'll always find it funny how half of the people go "you need to have solid theory proofs before" and the other half goes "where is the working code".

As I said we just started some housekeeping and are not ready to start, and as many point out, it's hard to make money on infrastructure. I know, I did not ask how to make money on this. The idea is to keep the base as open as possible and make money on other service built on this. I was only asking on pointers to funds interested on tech loosely connected to this. And if they don't like our current state or something else fine, no need for you to do their job, and in a witty way, too.


> yeah, thank you for the kind comments about not needing money (aka: my time as no value) and asking me to build the prototype with irony

The world (read: the investment community) doesn’t care about your time though. If you are going to pitch an infrastructure project that should work in software without a working demo, you’re going to have to take a huge haircut on valuation.

If you want to be successful with a theory and proofs, join academia and publish them. If you want to get investors for infrastructure, make something that works.

Academia churns out “protocols on paper” every year that go absolutely nowhere. You need to differentiate yourself if you’re looking for more than a research grant.


Doubtful anyone is going to throw money at you for some random idea, that you won’t even invest some of your free time on. You clearly don’t value it, so no one else will.


Depending on the scope, it's not always possible to self-fund while starting a project.


I'm just going to mention that NATS can be used as a general purpose transport, with encryption and a surprisingly capable authentication and authorization system. It also supports federating into clusters and superclusters. NATS has also come a long way in the last several years, in case anyone is thinking of some experiences they had years ago when it didn't have all these features.

The question would have to be "what does your idea/project offer that NATS doesn't already offer?"

I have no affiliation with NATS, but I wish that people were paying more attention to it. It solves a lot of problems people have.


the thing you want to make requires ZERO money.


"We hypothesize that flow-consistent routing is responsible for virtually all of the congestion that occurs in the core of datacenter networks".

Flow-consistent routing is the constraint that packets for a given TCP 4-tuple get routed through the same network path, rather than balanced across all viable paths; locking a flow to a particular path makes it unlikely that segments will be received out of order on the destination, which TCP handles poorly.


Or, by sending the traffic over all routes, there is no way to keep one server from monopolizing all traffic, because each route is oblivious to the stress currently being experienced by all its peers. It has to set a policy using local data, not global data.

The usual failure mode for clever people thinking about software is taking their third person omniscient view of the system status and thinking they can write software that replicated what a human would do in that situation. We are still so very far from human level intuition and reasoning.


Ultimately one server cannot inject more than one link worth of traffic (e.g. 100 Gbps) into the network which is a tiny fraction of total capacity. Researchers have gotten really good results with "spray and pray" for sub-RTT flows combined with latency and queue depth feedback for multi-RTT flows.


Spray and pray sounds like a reasonable fit for UDP, no?

We’ve had these sorts of bottlenecks before, and they didn’t last. It’s always possible something fundamental changed, but it’s also possible that we are doing something wrong as the motherboard or OS levels and adopting new solutions puts us right back in that space where a couple of servers can easily saturate a network.

If a network card can move data as fast or faster than the main memory bus on a computer then what are we even doing? Should we be treating each subsystem as a special purpose computer and turn the bus into a network switch?


You just described the motivation behind infiniband (and RDMA in general)


The network is the computer™


Well, I mean yeah, that silly slogan is definitely rattling around in my head.


And we could totally construct systems that take some approximation of a global internet state into local routing decisions. But that might devalue some incumbent player's position in the market (or create a new privileged set of players) so even if we made a POC, it wouldn't get adopted.


This is true, and the congestion mentioned here was subtle and not called out - typically flows are handled in a stateless manner by load balancers that hash on some set of MAC/IP/PORT features of the packet. This is where congestion occurs and the paper mentions it here:

    All that is needed for congestion is for two large flows
    to hash to the same intermediate link; this hot spot will persist 
    for the life of the flows and cause delays for any other
    messages that also pass over the affected link.
It makes logical sense, but I'd love to see the evidence for this.


“Elephant” flows are a definitely a thing.

It all depends on the application and overall use in of the network.

With sufficient flows and a mix of sizes it’ll still tend to even out. But if you’ve significant high-throughout, long lived flows this is definitely something you might hit.


I wrote a summary of one of the approaches for replacing TCP mentioned in the paper (called Homa) here: https://www.micahlerner.com/2021/08/15/a-linux-kernel-implem...


Jumping to the end:

> TCP is the wrong protocol for datacenter computing.

> Every aspect of TCP’s design is wrong: there is no part worth keeping.

I cannot disagree and Ousterhout argues well.

> Homa offers an alternative that appears to solve all of TCP’s problems.

I'm well behind the curve on protocols and now I have something to learn more about.

> The best way to bring Homa into widespread usage is integrate it with the RPC frameworks that underly most large-scale datacenter applications.

More or less the case for whatever replaces TCP in a tight computing warehouse setup.


> Every aspect of TCP’s design is wrong

The driver of most of a global network of computers which has been wildly successful beyond dreams before it was real… probably deserves a better deal than “every aspect is wrong”. It has worked fanatically well and chasing the long tail of performance improvements isn’t equivalent to determining what has gotten us here is wrong.


You're cherry-picking an interpretation of a single sentence, when it should be read in the context of the preceding one: Ousterhout says every aspect of TCP's design is wrong for (modern) datacenter computing. He's not saying bad decisions were made at the time it was designed, nor even that it's badly designed for other use cases today.

The first few paragraphs of the article give even more context:

> The TCP transport protocol has proven to be phenomenally successful and adaptable. [...] It is an extraordinary engineering achievement to have designed a mechanism that could survive such radical changes in underlying technology.

> However, datacenter computing creates unprecedented challenges for TCP. [...] The datacenter environment, with millions of cores in close proximity and individual applications harnessing thousands of machines that interact on microsecond timescales, could not have been envisioned by the designers of TCP, and TCP does not perform well in this environment


My Distributed Computing professor said, “now we are going to discuss why Ethernet is a terrible protocol but we use it anyway.”

Like democracy, everything else we’ve tried is even worse.


"Specifically, Homa aims to replace TCP, which was designed in the era before modern data center environments existed. Consequently, TCP doesn’t take into account the unique properties of data center networks (like high-speed, high-reliability, and low-latency). Furthermore, the nature of RPC traffic is different - RPC communication in a data center often involve enormous amounts of small messages and communication between many different machines."[0]

0: https://www.micahlerner.com/2021/08/15/a-linux-kernel-implem...


Everything about the protocol being wrong for the specific case of machines directly wired to one another over a high speed reliable network is not an admonishment of the protocol in general. And the protocol, being an abstract concept, doesn’t have feeling to hurt.


"Although Homa is not API-compatible with TCP, it should be possible to bring it into widespread usage by integrating it with RPC frameworks."

I was about to rant that Prof. Ousterhout should just deploy some of his students and get that transport - RPC integration done and prove out his point. But, then I tried to look for it first and found this:

https://www.usenix.org/system/files/atc21-ousterhout.pdf

Has anybody tried it in an actual data-center?


> … it should be possible to bring it into widespread usage by integrating it with RPC frameworks.

That’s a powerful word, “should”. Many software in the datacenter are almost as old as TCP itself, in whole or part. Difficult but working, they will continue to linger unless something more than six letters of aspiration is applied to reimagining and rebuilding that considerable bulk.


It's theoretically much easier to introduce a new transport inside of a DC, since you're inside the network perimeter and you'll generally have control over policy-based filtering decisions.


The context here is TFA advocating for use of higher-level message-oriented frameworks instead of raw socket APIs so that they can use a non-TCP transport without changing the application code.


Within a DC/cloud provider network. This isn’t about protocols you’d see traversing the public internet until the cloud providers see value in a protocol and start to push it out through IETF (eg see QUIC which was done by a company that owned both the browser and the data center). If there’s a “small” SW improvement that lets you use your HW 10x more efficiently that’s totally worth it given the end of scaling. You either invest in SW or pay for custom ASIC development. You’re not getting a free lunch anymore by just waiting a few years and getting that 10x gain for “free”.


Well, the key would be to develop and deploy Homa in a DC and test in implementation at scale. If it actually ameliorates the perceived shortcomings of TCP that make nothing in TCP worth keeping as this author says, then cool. My only complain with issues like this is the cost of implementation. Someone has to pay to build a DC around it or increase the cost of maintenance for several years to a decade while supporting two completely different and incompatible networks.


Homa protocl can be deployed on the basis of existing switches (not all of them, but some of them will play well). See also https://github.com/PlatformLab/HomaModule .

In addition, a closed environment and the SDN's popularity in large data centers is a significant cost-reducing factors compared to typical IPv6 deployment.


Dumb question, there's no way to talk to a PC over the Internet with Homa, right? Since our home ISPs + routers are all only doing UDP/TCP over IPv4/IPv6? Homa is mainly for "LAN"?


Yes.


Will we ever see the day where computers over the Internet talk to each other over something other than UDP/TCP over IPv4/IPv6?


Given Google and Apple's efforts to de-ossify the Internet, maybe. But no time soon.


Hmm I didn’t really read about it.

The proposal here is to replace IP as well as TCP?

Good luck pulling that off.


We have been testing out various protocols to overcome in our case TCP head-of-line blocking by using the protocols->

SRT: https://github.com/Haivision/srt (C++ wrapper https://github.com/andersc/cppSRTWrapper)

RIST: https://code.videolan.org/rist/rist-cpp

KCP: https://github.com/Unit-X/kcp-cpp

We wrap all data in a common container format https://github.com/agilecontent/efp

To decouple the data from the transport.

Yes the above solutions are media centric but can be used for almost any arbitrary data.

The protocols are not 'fare' so starvation may happen, and must be handled on the application level.

/A


For those unfamiliar with the author.

https://en.wikipedia.org/wiki/John_Ousterhout

He is probably most famous for having created the Tcl language and Tk GUI library. He also worked on the Sprite distributed operating system, the Magic VLSI design tool, and a bunch of other things.


Also, more recently, one of the inventors of the RAFT algorithm.


Hmm, haven't read the paper yet, but I immediately did "Ctrl+F sctp" and didn't find anything.

I know that sctp was the next generation stream-oriented protocol designed to fix the out-of-order message problem in commnications, as well as a whole bunch of connection issues (4-way handshake instead of 3-way for better open/close. Datagram oriented in-order stream, so that every packet has a proper size involved. Etc. etc.)

As far as I know, sctp should solve all the requirements in section 2 of this paper (except "load balancing", which might be solved by lower-level protocol sharing of some kind?). So its weird to not see sctp discussed.

--------

Yeah, sctp ain't popular, but these exact sets of problems / requirements and issues with TCP have been known for decades. SCTP, is also a decades-old protocol (though not as old as TCP), and is the most obvious solution to the problem (and already supported by Linux).


23 years ago I sat in a meeting with Sun, Intel, Mellanox, and 1 or 2 others. In that meeting we discussed putting an RDMA interface on individual hard drives, trays of RAM, CPUs, and other more exotic devices (like battery backed RAM, no conventional SSD in those days of course). You’d install RAM 1 42U rack at a time, disks likewise, CPUs in another rack and so on. All partitioned, controlled, managed, and of course billed-for by a “data center OS”.


Disaggregation costs an absolute fortune. The network is 10% of the total datacenter cost and the network does not carry memory and PCIe traffic. If you make the network 10x faster to carry that, now it's 90% of the cost.


I was in the room for similar discussions as part of the OpenCompute project, which was around remote IO and resource disaggregation. There are systems like this, and they may make sense for certain use cases, but generally speaking hyperscalers today are built around virtualization which doesn't align well to this model.


The network is the computer....


Out of order delivery is fine in TCP within the window. It might be inefficient but it's not impossible, reassembly could be moved to userspace if userspace TCP was used.

I have no problem with alternates to TCP in the DC with a crossbar fabric and far less loss, seems sensible.

I wonder how it would play with QUIC and the session like behaviours now emerging.


> • In-order packet delivery

This is a bit disingenuous, since it’s not the wire protocol but the kernel API that maintains the in order abstraction. With Jumbo packets you can still push a mountain of data without tripping up on “in order packet delivery”

As developers we like this in order delivery to userspace because it vastly simplifies the code. We make up for the inefficiencies by processing dozens of hundreds of streams in parallel. We aren’t going to give that up just because the wire protocol changes.


Is it not considered a protocol violation to deliver out of order segments to the upper layer? That seems the same to me as abusing it to not require retransmits either.

Remember, middle boxes can fully adhere to the TCP standards and terminate your TCP connection and enforce ordering. If you notice that, you’re not really following the protocol, you’re just using its header format.


Yes, but.

My read of the room is that he's conflating wire level and kernel level problems with userspace problems, which is a no-no because if Berkeley userspace has latency problems, we can deal with that separately from undoing 40 years of tribal knowledge in the process.

In the video he says that he was seeing 3x of theoretical latency to userspace that he fixed with Homa, but similar efforts to fix Berkeley Sockets saw 'almost a 2x' improvement which he deemed insufficient. A question I'd like to see answered over the next couple years is what IO APIs will be the most efficient in a world where io_uring is everywhere.


There was a time when out of order packets triggered congestion handling in TCP stacks which drastically reduces performance. This is where the concern comes from. I think it's a bit out dated though, I think the newer schedulers ignore out of order delivery.

I've also seen problems on some embedded stacks, but that could easily be argued that the implementation is wrong. But I've seen things like credit card terminals break due to packet reordering.


I can't fault Ousterhout for writing in support of a new(ish) idea but his language here went to "forbidden" when in fact it's just "strongly disliked"

TCP the protocol knows how to re-assemble out of order. What I think he's doing is making it a higher task to do it, outside of the protocol, or else providing some mechanism in user process space, amenable to threading.

I can believe an async model of "tell me when this is complete" would work well with a bitmap/bloom filter type gate on what "has to be complete" to proceed.

I like his writing. I was a fan of tcl/tk and used expect heavily back in the past.


IBM AIX's TCP can either do selective ACKs or handle out-of-order packets. Which we were told when a firewall started reordering packets. It simply dropped out of order packets and therefore triggered congestion handling.


The assumptions here seem to be no speed of light lag, no packet loss, and no security. The only problem is congestion. That's more like the interconnection fabric of a single-purpose supercomputer than a general-purpose data center. Which is probably why they mention Infiniband, a hardware interconnect for supercomputers, so much.

Would this break down if you had to start talking to a remote machine in another data center? That's how outages and overloads are handled, after all.

This is an interesting idea, but it's for a relatively narrow use case.


It’s a narrow use case sure, but at the same time it’s an important one to a number of companies with significant engineering resources.


There's a little misleading point on page 2, or just a mistake. It says "Driving a 100 Gbps network at 80% utilization in both directions consumes 10–20 cores just in the networking stack" and it cites Google's Snap paper from 2019. But the Snap paper quite clearly says nothing like that. It says that Snap can drive a 100gbps NIC to 80gpbs with just 1.05 cores (Table 1) and that the whole-machine CPU load at 80gpbs in an RPC benchmark is 4 CPUs per side (Figure 6(b)).

Aside from that I completely agree that TCP is trash.


Yes that stat stood out for me too and I was wondering how to actually test this without breaking anything in the process.


> Yes that stat stood out for me too and I was wondering how to actually test this without breaking anything in the process.

dpdk has been doing just that for quite some time now. perhaps you can try that and see ?


Thanks I will look into it but on first glance their testing still seems like its under lab conditions. Thanks for the tip!


At a glance, this sounds very similar to AWS' SRD protocol [1] - I'm curious how they compare but I see no mention of SRD in this paper.

[1] https://assets.amazon.science/a6/34/41496f64421faafa1cbe301c...


Strange how this is so far down the list but a more recent "nobody in the industry wants to invest in making things better" rant is higher upvoted.

Clearly, AWS knows a thing or two about the datacenter industry, wanted to invest in making things better, and DID invest in making things better - and even published a paper on it.


Sounds very familiar to a lot of ideas that have been proposed as improvements to TCP over the years.


I am not really sure what Ousterhout means when talking "datacenter" but one of the key aspects of TCP networked applications is that it really doesn't matter if they are located in a datacenter, at the edge, in a mobile device, in a Raspberry at home or in your car and, better, they can be more or less moved from one hosting to another. Will this mean that application developers need to work with different network stacks at the same time?


Yes, you would have multiple protocols at some points. If you look at microservice architectures they already have an internal service mesh and external API gateways.


I believe something missed in these discussions is that pertaining to "the data-center". A data-center is not a technology, it is a grouping of assets. Those assets and their associated services need to communicate not just with each other, but with other assets and services on the internet.

Regardless of what incredible technical solution one creates, it will have to allow for simultaneous existence of current IP protocols in parallel with whatever proposed replacements to exist seamlessly with one another or significant adoption would never occur. A data-center is not an isolated bubble, at least not any more unless one wants to translate said protocols through a single point of success gateway. Should such a replacement ever occur it will have to be done piece by piece until there is nothing left using current IP protocols.

So I believe people should all create their proposed protocols and give businesses a low-friction path to adoption one service at a time. As more applications adopt said protocols, the most popular, highest reliable, most performant least friction path will likely win and if successful then at some distant point in the future perhaps most existing IP protocols could be deprecated. As a reminder, each application will have to adopt libraries to speak on this protocol and know how to utilize it. There will be a "battle hardening" period to work out the bugs and security controls. All of the network gear in the entire path between data-centers and clients will need OS/firmware/Asic updates to understand this protocol. Given the transition speed to IPv6 as an example, this could be a very long road.

There is also some discussion of QUIC and SPDY. Those are not new IP protocols. Those are new standards within an existing L7 application protocol HTTP that still utilize existing L3/L4 protocols. Replacing TCP and UDP? in the data-center means a new protocol in /etc/protocols not encapsulated in an existing protol such as protol 6 TCP or protocol 17 UDP. The network gear and OS on every device in the path will need to understand this new protocol. Tools such as tcpdump and libraries such as libpcap would need to be updated to understand these new protocols before they would even be used in a development environment.

Could it be that I misunderstood the intent and perhaps we just want yet another new L7 application protocol on top of UDP?


> Every significant element of TCP, from its stream orientation to its requirement of in-order packet delivery, is wrong for the datacenter. It is time to recognize that TCP’s problems are too fundamental and interrelated to be fixed;

This seems like a pretty bad way to start a paper. It throws an extremely strong assumption into the room without backing it up by data.

Having worked in the Cloud/Datacenter space for lots of years, I really have a hard time describing any situations where TCP limited the performance of applications and not anything else. It doesn't matter that much if a slightly different networking stack could lower RTT for from 50us to 10us if the P99.9 latency of the overall stack is determined by garbage collection, a process being CPU starved, or being blocked on disk IO or another major fault. Those things can all be in the 100ms to 1s region, and are the real common sources of latency in distributed systems.

The main TCP latency problem that I've experienced over the years is SYN or SYN-ACK packets getting dropped to due overloaded links or CPU starvation, and the retry from the client only happening after 1s. Annoying, but one can work around a bit racing multiple connections. Besides the TCP handshake time there's also another round trip for setting up a TLS connection - sure. But both of those latencies are in practice worked around with connection pooling.

Speaking of TLS - I can't find a single reference to it in the paper. And talking about datacenter networking without mentioning TLS sees to miss something. Pretty much every security team in a bigger company will push on TLS by default - even if the datacenter is deemed a trusted zone. How does it matter if the TCP connection state is 2kB and the HOMA state is less, if my real TCP connection state is anyway at least another 32kB for TLS buffers plus probably megabytes for send and receive buffers plus whatever the application needs to buffer.

Last thing I would like to mention is that datacenter workloads are not "just messaging", and the boundary from messsaging to streaming is pretty fluid. What happens if the RPC calls fetches the content of a 10MB file? Is that still a message? If we treat it as such, it means that the stack needs to buffer the full messsage in memory at once, whereas with TCP (and e.g. HTTP on top) it can be streamed, with only the send buffer sizes being in memory. What about if it is 1MB? We could certainly argue that some applications just transfer a few bytes here and there, but I'm seriously not sure if I would label those as the majority of datacenter applications. And with the typical practice of placing a lot of metadata into each RPC call (> 5kB auth headers, logging data, etc) even the smallest RPC calls are not that small anyore.


At first glance though he was calling for the replacement of TCL in the Datacenter!

That's not what I'd Expect from John Ousterhout. ;)

https://wiki.tcl-lang.org/page/Expect


Why not, he's had a Raft of good ideas.


Google hasn't used TCP in the datacenter for years. What they use I don't know. But it's even custom switches with custom chips.

My son did work in graduate school for a clean-slate network implementation of a network for the datacenter. Maybe Google, I don't remember.

One issue I remember they addressed was, scheduling bandwidth for VM migration within their datacenter cloud. See, some customer reserves a 'machine' for their services but really they get something like a VM slice of a ginormous machine (Multiple TB memory, 100 cores or whatnot). Each customer gets some of that and thinks it's a machine of their own.

That customer slice shares the larger machine with maybe 10-100 other customers. Then somebody's slice starts to use more resources and has to be moved to a machine with more 'room'. That wants to be fast and seamless. It can be maybe 1TB of stuff. Their slice doesn't want to be interrupted for long. So this machine needs bandwidth that isn't subscribed for the migration. So does the target machine. So does the cloud network. Then all the addresses have to be re-homed.

Another issue: those competing slices need a virtual network adapter. They each think they own one (each is running a copy of linux or whatnot), but it has to be a physically shared and rationed device. All while using the TCP abstraction on a network-adapter abstraction on a driver abstraction, but really on their new network hardware that's actually present on the ginormous machine. This includes all the TCP features plus the bandwidth reservations the cloud needs etc.

So yes it's abundantly obvious that the datacenter needs (has) a new network.


> Google hasn't used TCP in the datacenter for years.

That's absolutely false. I don't have any sources except for having worked at Google from 2013-2022, but it's not like you quoted any sources either, so...

There's a reason why Google is still releasing stuff like TCP BBR (2017).


Well, the first link googling 'google datacenter hardware' is google's article on how they don't use standard TCP hardware or software in their datacenters. But I guess that was too much to ask...


There's a big difference between "hasn't used TCP in the datacenter for years" and "don't use standard TCP hardware or software in their datacenters". Google uses TCP with non-standard configuration, but they still use TCP.


I can't seem to find the same search result with this query. In fact, searching for "google" "standard TCP" doesn't seem to find any such article (only Google's 2011 publication on TCP Fast-Open deployment, ironically), so it's going to be hard to find what you're talking about.

If you link to the article in question (and relevant quotes) I'm happy to try and clarify your misunderstanding.


Another good reason for VM migrations is cooling. Apparently certain cloud providers save a lot of money on cooling when they migrate vms across devices.


TCP is basically exists to deal with an unsolvable problem, "the two generals"

Its easy to look at it and say it could be better because you have tunnel vision for your use case. However its also easy to forget that the protocol literally has a billion edge cases.

Why do you think your NIC has a new driver update every couple of weeks even though they have been running the same protocols for the last 50 years?


Often these “revolutionary” changed get deconstructed and co-opted. Not always a bad thing. Looking through a video he did about this paper, I see a few things that could be popped out.

SRPT (shortest remaining processing time) is there any reason this couldn’t be implemented as a LAN protocol?

Receiver driven congestion control: isn’t this what the transmit window does? Are we not just talking about setting a more aggressive starting value?

“sends packets in any order” and “can only send the first few packets without a grant (ack)” are fighting each other. Especially if you’re building a message oriented protocol, which tends to have shorter conversations. This reads as confused or schizophrenic.

I think of what us being said here is that the Berkeley socket protocol sucks, and that a different one can get from user space to response faster. Great. But do you have to change the wire protocol for that or just introduce a better system call library? This part in particular reads a lot like , “what if Erlang was right and we implemented it at the kernel level?” Which is not a bad question to ask.


If only we had a stream control transport protocol and were allowed to use it.


You can use SCTP in a datacenter. Ousterhout el al. are surely aware of SCTP so I assume Homa is better in some way.


Forgive my ignorance but why isn't SCTP more frequently used in DCs? I know it misbehaves with home routers etc. but shouldn't be a factor here.


I suspect there are a couple of contributors.

TCP is prevalent on the internet, so you need a fairly strong motivation and benefits to adopt a second protocol. A lot of engineering also doesn't get underlying networking, so one of the successes of TCP is it's a file descriptor that you either write to or read from and magic makes it come out the other side. I've seen tech leadership on networking centric products know nothing more than you read and write and magic makes the data appear on the other side. Even on implementations that use SCTP, I've seen products that only using a single stream and mark every message as requiring in order delivery. So it was effectively what TCP offers using the SCTP protocol.

At the time TCP was also far higher performance than SCTP. This wasn't so much a protocol thing, but because TCP was getting more engineering attention, it got a lot more scheduler optimization, kernel optimizations, and hardware offload support. So in many ways I think TCP scaled better due to these optimizations, which work both on the internet and internally. And then for multi-path, most data centers didn't get truly isolated networks. So if I'm running a mixture of TCP and SCTP, I still need L2 failover everywhere, which means my multi-homed SCTP connection isn't actually path diverse. And then where beneficial over the internet, there are a few success cases of using multipath TCP extensions.

SCTP is still used quite a bit in the telco networks, but due to the above, it was quite a waste of time.


How does your theory that the failure of SCTP is because a) people don’t understand networking and b) tcp eats up all the development oxygen explain QUIC?

I’m also not sure what you mean but DCs within a major cloud provider are majority AFAIK running truly isolated networks interconnected directly with fiber.

If you haven’t yet, I would recommend reading the very original QUIC paper. It was extremely astute and showed quite a deep understanding of what the problems were with TCP done by network engineers who really knew their shit (I got to interact with some of them when I was at Google). They talk about the failures of SCTP on technical levels and non-technical headwinds that weren’t accounted for like ossification. To my knowledge QUIC is SCTP 2.0 - it provides much of the same features and in a way that could actually leave the lab.


> How does your theory that the failure of SCTP is because a) people don’t understand networking and b) tcp eats up all the development oxygen explain QUIC?

I think this is the motivation side of the argument. SCTP doesn't provide any advantage internally for most use cases, as I outlined my thoughts on the basis above. QUIC on the other hand is an attempt to solve a completely different set of problems, and is getting the engineering dollars to deploy because where latency and internet comes into play, there is a strong motivation to be faster. And it also becomes more of an upgrade path.

> I’m also not sure what you mean but DCs within a major cloud provider are majority AFAIK running truly isolated networks interconnected directly with fiber.

Sorry about being unclear, I typed the out pretty quickly. One of the main factors that drove Telecom to create and adopt SCTP, is the way telecoms like to interconnect with eachother. For signaling traffic (message like I want to setup a new phone call), the telco's like to set up multiple independent connections. So with SCTP, they want multi-path support, where each server advertises a list of IP addresses for the connection. So between two telco's, you have a dedicated non-internet connection A, and a diverse network B. Equipment that communicates on these networks is then physically plugged into both networks. This creates a need for a protocol that understand this, and when a failure occurs in transmitting on the A network, retransmission occurs on the B network. The idea is these are diverse networks, nothin can really interact with both at the same time (that's the theory, in practice there be stories).

Where this maps to data center networks, is to my knowledge most data center networks are not designed into an A and B network for diversity. Where you would have to use multipath TCP or SCTP. And if you want to use both together, you're going to design the network to support all the failovers and redundancy to deliver TCP.

So that's what I was trying to get at, where there is a big adoption driver and protocol complexity is on the multi-path support, which to fully utilize requires additional engineering effort in the data center.


Same reason Homa isn't used: software isn't written for it.

With SCTP there's also a significant performance impact because many drivers for the protocol are far from optimised, because very few applications use it, because of its performance implications, because very few programs use it, etc. etc.

There's also firewall issues: big firewall vendors just don't play nice with anything that's not a variant of HTTP(S). You still need some kind of firewall in a datacenter and it'd be foolish to set up two different ones for internal and external networking. Protocol ossification is real and if you use any external piece of firewall kit, you're sure to run into problems if you try to use "novel" protocols like SCTP. Hell, you'll be lucky to get good IPv6 support.

You can write your own access control if you want but that's often perceived as more expensive than buying a box, especially if the box companies find their way into a meeting with management.

Lastly, there's education. A shocking amount of developers have no idea about how networking works. They probably know there are protocols like UDP and TCP but their role and inner workings are often glazed over in my experience. Practical networking courses seem to treat the network as some kind of black box where bytes and IP addresses go in and response data comes out. If developers do know their basic networking, that information is often out of date; people don't seem to realise how often TCP gets tweaked to behave slightly differently to improve performance. Ask your average dev something about IPv6 and I doubt they'll know much more than "it's IPv4 with more bits" because networking simply doesn't come up that often.

In the end, it comes down to tradeoffs, experience, and decisions. Feel free to write SCTP code for your servers products where you can, the protocol definitely solves many issues people run into in TCP, but you'll probably have to defend your use of something unfamiliar to many developers every step along the way. The same is true for protocols like QUIC (outside the HTTP(S) environment) which tries to solve a whole lot of layer 3 to layer 5 problems in a single protocol that's designed to play nicely with shitty middleware boxes by its basis in UDP.


Software -- legacy software, which is always all software currently in use, which is an enormous code base.

It would be easier to have a drop-in replacement for TCP that, whenever it can work, connect() will use it, and which listen()/accept() will work with as well as TCP. Then all apps that can use TCP could use the new transport transparently.

Basically, we need a TCP++ that works with existing APIs but which can also provide new functionality via new APIs.

Of course, backwards-compatibility is very limiting, which sucks.

We can also have new transports that have new APIs, but we need a better TCP for backwards compatibility because legacy is forever.

Also, the focus on RPC is cool because any protocol where you typically have a library doing the I/O -and not too many such libraries- is amenable to using the new thing, and that includes HTTP (which isn't an RPC). But TFA really needs to mention HTTP in the same breath as RPC, because -sadly- way too many readers will just close the tab as soon as they see "RPC" and not "HTTP".


Could we get the title replaced to match the document, rather than the cliched, bombastic "It's time to ..." formulation?

Ousterhout actually wrote:

> We Need a Replacement for TCP in the Datacenter


One of the reasons we're 'stuck' with TCP and UDP on the Internet is because most middle-boxes (firewalls) don't really understand anything else.

If you're strictly operating inside a DC, presumably with minimal/fewer firewalls, could alternatives like SCTP and DCCP be an option?

* https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...

* https://en.wikipedia.org/wiki/Datagram_Congestion_Control_Pr...

* https://en.wikipedia.org/wiki/Transport_layer


Funny. I worked at AT&T Bell Labs in the 80s. All of these insights seem eerily familiar.


Sure; they're also similar to the SCTP insights, from the late 1990s.


Yup. Also QNX native networking protocol. QNX's basic networking primitive is a remote procedure call. So there's a message-oriented network protocol underneath. It can be run either on top of UDP or directly at the IP level.


The original darpa paper is what the TCP/IP stack is still based on right ? It feels like it was never even intended to be at the scale its deployed on. Which amazes me to be honest that people have somehow gotten it to work to this scale.


The congestion control algorithms have been significantly improved since then. But yes it’s a testament to how good the original design is that it’s still what we use today.

Protocols like SCTP and QUIC work similarly but can avoid head-of-like blocking.


FWIU, barring FTL; superluminal communication breakthroughs and controls, Deep Space Networking needs a new TCP, as well:

From https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_t... :

  void tcp_retransmit_timer(struct sock *sk) {
  
 /* Increase the timeout each time we retransmit.  Note that
  * we do not increase the rtt estimate.  rto is initialized
  * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
  * that doubling rto each time is the least we can get away with.
  * In KA9Q, Karn uses this for the first few times, and then
  * goes to quadratic.  netBSD doubles, but only goes up to *64,
  * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
  * defined in the protocol as the maximum possible RTT.  I guess
  * we'll have to use something other than TCP to talk to the
  * University of Mars.
  *
  * PAWS allows us longer timeouts and large windows, so once
  * implemented ftp to mars will work nicely. We will have to fix
  * the 120 second clamps though!
  */
/? "tp-planet" "tcp-planet" https://www.google.com/search?q=%22tp-planet%22+%22tcp-plane... https://scholar.google.com/scholar?q=%22tp-planet%22+%22tcp-...


The ultimate dream protocol is one in which a sender just encodes bits in a certain way such that the receiver will get them, and puts them on the line without any handshaking or synchronization. I don’t think this is impossible. The space of orthogonal codes across time and frequency could be chosen to be practically infinite, therefore, any random selection of two such codes would look like white noise to each other. The receiver would have to listen on a large subset of such channels all at once, which is not practical in real-time, but could be practical looking backward at the stored waveform from some carrier channel that all such possibilities have in common. It would commonly miss single bits and large chunks of data, so it would have to have FEC across multiple scales of code, frequency, and time. This works for large messages, but smaller messages would have to be sent over a channel with bandwidth narrowed to consume the time window of detection. Thus you should have a fair guarantee that either every message will be received during the detection window, or no message will be received, and this could be the job of network infrastructure to monitor and buffer as necessary. If that fails, then, well, maybe let the application layer deal with it.


> The ultimate dream protocol is one in which a sender just encodes bits in a certain way such that the receiver will get them, and puts them on the line without any handshaking or synchronization.

This is a recipe for DDoS.

Some handshaking is always necessary. You can minimize it, but you can't get rid of it.


> This is a recipe for DDoS.

Inside a datacenter?


Inside a datacenter it's called massive incast and it's still pretty bad.


Oh, right, in a datacenter probably not.

EDIT: But you know, UDP meets your bill.


Datagram is a layer-3 protocol. There’s a lot going on underneath that.


I think that was an interesting read. I have worked on an userspace implementation of TCP via DPDK so I have sympathy for the limitations mentioned (the load balancing and thread scheduling argument are very accurate).

However I would have liked a section dedicated as to why hardware accelerated UDP wouldn’t be an adequate solution vs a whole new protocol. It seems to me it provides a solid basis for achieving the results the author wants to bring about?


I think protocol churn extends well past datacenters and homa, with the need for 21st century domestic protocols to remove all of the baggage from tcp/ip completely.

TCP/IP could be offshored where the undersea fiber meets the shore and setup perimeter dataedge servers as the only TCP/IP nodes on domestic boundaries.

For a domestic protocol, it does not need to be routable and saves even more overhead (I liked the tcp joke in the comments btw).

Encapsulation over a domestic protocol would still make subscriptions to a tcp/ip service available, but IMO, domestic boundaries would be better served with a modern domestic protocol.

Hypothetically speaking, if every TLD was established on its own unique protocol in the original design, Internet 1.0 would have matured much differently. Every TLD is a .com nowadays anyways, too ambiguous to be of any use (and the .org debacle still makes me laugh. They forgot what an organization was and just blended into a .com reject. That mission statement was just toilet paper after the tug-of-war).

As it turns out, one size does not fit all with protocols. Multiprotocol networks within domestic borders without tcp/ip wouldn't change any of the benefits of a data network.


My take: this is the making of a great essay question for a graduate course in networking.

The chance of TCP being thrown out wholesale is zero. What might happen is some slow, incremental improvements in certain aspects of networking, like some other comments suggest.

Until eventually, a networking student picks up a dusty old TCP doc and says to his teacher, innocently, "But we're not doing hardly any of this stuff anymore!"


Ourtserhout's paper perhaps comes from his vision for RAMCloud as well ... in which he bet on network latency going ultra low over time that accessing memory on another machine was fast enough to enable whole new categories of applications to emerge.

https://dl.acm.org/doi/10.1145/2806887


Haven't data centers switched to RDMA already? Why are we still wasting time with this networking nonsense when most of the time we're just copying data from one memory or cache to another, over a private interconnect? ;-)

It seems that Osterhout's complaints about RDMA are about current RDMA implementations. I expect many of them are fixable.

Osterhout's complaints about TCP are all valid, though he doesn't mention my pet peeve with TCP, which is that connectivity breaks when your IP address changes. And requiring apps to deal with IP addresses in user-level APIs seems like a mistake.

Simple request-response RPC protocols were a good idea in the 1980s, and they're still a good idea today. I should probably read the Homa paper(s) regarding congestion control though, as it isn't covered in this PDF.


Ok I watched the Playlist movie about Spotify and they mentioned they fork TCP IP and made it better, that was a incorrect statement because under the hood everything on the internet still runs TCP/IP they probably just improved their application level or move to UDP


It's interesting that everyone (including the author) talks about UDP as a lossy protocol, but it doesn't seem that UDP drops actually occur on a routine basis anywhere. The UDP-based DDOS attacks seem to prove that; if UDP really was being dropped, those DDOS attacks wouldn't be so problematic.

That said, it's an interesting read. TCP is inefficient, but that inefficiency has been patched/masked by hardware solutions.

It's nice to see that someone's still thinking about this this. I remember the days when there were tons of non-IP protocols floating around (IPX, DecNET, AppleTalk, etc). TCP/IP won, which was not an obvious thing at the time.


I think re UDP the point regarding it being unreliable is that you have to design your applications to take the unreliability into account, because it does happen even if it may be infrequent: assuming that it is reliable when you can get unreliable behavior will result in correctness issues.


At which point most applications end up reinventing a good chunk of TCP.


I see UDP more like a low level interface allowing you to build your own on top. Where you decide what packages need to be revived 100% and which one can be dropped. Basically the foundation of your very own TCP with hookers and blackjack.


The new digital television broadcasting system in the US (ATSC 3.0) is exactly this. It's all UDP, but wrapped in another layer which allows multiple virtual streams, and that's all encoded in a CDMA wireless protocol. It's bundled up at the broadcast center, sent out via the big towers, and then unwrapped and decoded on the receiver. The end result is that once the receiver chipset has stripped off the wrapper, the OS of whatever client device is consuming the broadcast just gets regular looking UDP packets filled with MPEG-TS or DASH media streams, plus web pages, ads, games, or whatever. A.k.a. blackjack and hookers. Think of it as a giant one-way WiFi network using just UDP for the packets. It's honestly pretty cool.


> Basically the foundation of your very own TCP with hookers and blackjack.

That’s exactly how the early drafts of the QUIC RFC described it /s


Nothing deliberately drops UDP packets, but packets of all sorts get dropped when there's congestion.


Which is why protocols like TCP are built on back pressure - so you don’t keep making the exact same mistake in a tight loop. Happy path behavior doesn’t matter when the worst case or even median case are nonfunctional.


> but it doesn't seem that UDP drops actually occur on a routine basis anywhere.

I have seen code that didn't even handle out of order delivery work well for over a decade in local networks. Even when that broke down it turned out that IP package fragmentation just triggers a slow path in smart switches, so if your packages fit into the networks MTU (with some bytes to spare for VLAN tagging) you still might be able to avoid the problem.


> It's interesting that everyone (including the author) talks about UDP as a lossy protocol, but it doesn't seem that UDP drops actually occur on a routine basis anywhere.

Just to clarify, you are referring to in the datacenter right? They occur in wireless all of the time.


Frame drops are endemic to cloud datacenters. High levels, all the time.


“Lossy” is not the word.

Reliable/unreliable are the words.

Packets can and do get dropped, TCP, UDP or otherwise. It’s just a question of how the protocol behaves when that happens.


Anyone for TIPC?

http://tipc.io/


Homa over EVPN-VXLAN, across multiple data centers - what could possibly go wrong? See [1] (4)

[1] https://www.rfc-editor.org/rfc/rfc1925


> Although Homa is not API-compatible with TCP, it should be possible to bring it into widespread usage by integrating it with RPC frameworks.

Not being sockets API compatible kinda sucks. Ok, we could use a new connect() variation that allows for earlier data send, but the API being mostly similar would help -- there's a ton of socket code out there!

As for RPC, well, RPC is mostly a thing of the past with most everything today is HTTPS, but usually through libraries that do the I/O, so it's possible to retrofit a new transport protocol into them. But why not just use QUIC within the datacenter?


I don't think they mean RPC in the sense of rpc(3), but rather things like gRPC or protobuf. The point is that if both sides really just want to talk gRPC to each other (and never want to talk plain TCP to each other), then that use case is "easy" to meet by implementing gRPC on top of Homa.



I think the main thing working against any alternative is that it is easier to keep consistency at all levels, instead of trying to TCP with external connections and something else with internal.



I get the issue with TCP, but I'm not sure about Homa... e.g., why not UDP with some homa-like semantics on top? That might really ease the "Getting there from here" issue.

In fact, what's actually being suggested is for applications to replace calls to TCP-based APIs with calls to gRPC APIs (or other high-level RCP APIs), where the transport layer becomes an implementation detail. Fair enough, but this is a very round-about way to go about it.


It would be interesting to see a write-up from them on Homa's benefits over UDP. UDP has been used to work-around issues with TCP both in the datacenter and in unreliable WANs.

Skimming over the paper, I think the magic of Homa is in its RPC calls and its short-lived connections. When they're handled at layer 3 and 4, they can provide a significant hint to switches, routers and hosts regarding prioritization and congestion control. If you agree upon the prioritization algorithms as part of the protocol, then both sending and receiving hardware can coordinate much more easily.

If we instead just implemented something like Homa on top of UDP, it would basically mean that the top of the OSI stack would have to somehow inform the lower layers of the stack about these "sessions". You'd also have to hope that 3rd party peers decide to implement the hints in the same way. This would result in much more complexity.


Here's what I'm thinking:

You add a homa-like header inside a UDP packet. Inside the data center you use switches, nics, etc that know and understand the home-like protocol and can implement homa-like behavior... as needed. Anywhere else, you can fall back to UDP, due to its ubiquity.

Yes, various things would have to somehow know the homa-like protocol was being used... just like various things would have to somehow know the homa protocol was being used. Yes, different vendors would have to have compatible homa-like implementations... just like different vendors would have to have compatible homa implementations.

I think the complexity you mention is inherent in anything that actually gets more widely deployed and used.


I wrote a blog post that might be interesting to those wanting an introduction into some of the basics of the problems called "02-FEB-2011: Why is there packet loss ?" [0]

[0] https://rkeene.org/projects/info/wiki/176


Great post. I like the closing:

    Ultimately, we decided the best thing to do was to do nothing and hope for the best.
All of the experienced network engineers that I've worked with who have run into issues that feel like they can be improved by tweaking QoS somewhat, always end up saying to me, "Nah. Just get a bigger/more pipes." I've never been in a position like yours to make a cogent argument as to why they were wrong, and laid out the details as well as you did there.


Why not just use UDP instead?

I feel TCP is just designed to send files and data segments, but it doesn't work well for other things.


I mean, part of the envoy/grpc thing is to paper over some of the shortcomings of TCP so that yes, it's still TCP underneath, but you're not setting up connections the same way. Furthermore, any improvements in the space, the envoy sidecar is well positioned to do that upgrade.


1 security vendor flagged this URL as malicious

https://www.virustotal.com/gui/url/43f33fe70cb4ef9fcc2370460...


> It uses several techniques for this, of which the most notable is that it takes advantage of the priority queues provided by modern switches.

I'd understand if they said routers, but switches? Do L2 switches have any notion of priority and if yes how does it work?


Typically you map a VLAN to a priority via PFC (Priority Flow Control). You can do it with vconfig on Linux. Switch OSes have their own CLI for this.

Some switches can do PFC for untagged packets. They classify based on DSCP, and map that to a PFC priority. I've never used that though.


I always assumed google datacenter has their own protocol already


until this replacement is as stable in time and simple to implement, no worries. But if it is a stability joke or need 748937493874943 devs to code an alternative...


"Homa demonstrates that it is possible to create a transport protocol that avoids all of TCP’s problems." - This is a huge statement... It assumes the author knows all of TCP’s problems...


spdy for local rpc. interesting idea. seems the easiest way to get it out there would be support for it in grpc?


Do we know, where it is submitted?


Time to return to IPX protocol ?


UDP?


I'd tell you a UDP joke, but you probably wouldn't get it.

So here's a TCP joke:

Hello, would you like to hear a TCP joke?

Yes, I'd like to hear a TCP joke.

OK, I'll tell you a TCP joke.

OK, I'll hear a TCP joke.

Are you ready to hear a TCP joke?

Yes, I am ready to hear a TCP joke.

OK, I'm about to send the TCP joke. It will last 10 seconds, it has two characters, it does not have a setting, it ends with a punchline.

OK, I'm ready to hear the TCP joke that will last 10 seconds, has two characters, does not have a setting and will end with a punchline.

I'm sorry, your connection has timed out... ...Hello, would you like to hear a TCP joke?


The handshake sequence is exaggerated. It's just usually 3 messages.

The 3 initial messages establish a connection.

    A: I would like to tell you something. (SYN)

    B: I acknowledge you want to tell me something. (SYN-ACK)

    A: I received your acknowledgement. (ACK)
After the handshake sequence is done, data transfer begins.

Only then, it is possible to know that the "something" was a joke.

If B answers with RST instead of SYN-ACK the connection is refused. If B doesn't answer, A will interpret this as a connection timeout.


Not even "after", you'd generally put the message in the same packet as the second ACK.


Only with specialized client software. The connect() system call doesn't return to the caller until the three-way handshake is complete, so you need a fourth packet to send useful data to the server socket.

In the other direction, where you have datacenter hardware and custom kernels, it's common to see the stack cheat and start blasting packets back to the client as soon as it gets the initial SYN, just expecting that the ACK will arrive normally.


Well, even if you wait for the kernel that's only microseconds before you can go from ACK to data. No round trips necessary.


You're missing the frame size comms that then occur following this.


A classic, but it's really far from how TCP actually works.


It needs something like -- has anyone told a TCP joke in this vicinity in the last 10 seconds? If so I will tell the joke at a slower rate


Did

Did you

Did you hear

Did you hear the

Did you hear the one

Did you hear the one about

Did you hear the one about traceroute?


Hello, would you like to hear a UDP joke?

No

Knock, knock!


Wait, two characters, as in, like, characters in a string, or characters in a story? Is a punchline like a newline?


Ack.


Fin!


Rst


Ack


There's no ACK for RST.

RST is the equivalent of hanging up.


RST comes at anytime and is sent anytime.


Also, if an ack packet takes a really long time for some reason, it may be duplicated and arrive after the connection is shutdown.



They're not trying to make a universal standard. People are way too quick to dig out this xkcd.


So, you are suggesting a different standard response.

I think this cognitive trap neglects the lessons of NetBEUI.

Enhance your calm.


Cute, but I am suggesting not having a standard response.


And "not having a standard response" is the antithesis of global communication.

I appreciate your perspective though.

Have a wonderful day.


I usually find that standard responses impede real communication.


The ISO OSI stratifies communication into layers, with lower layers tasked with the role of encapsulating meaningful human content for delivery across the networks. The protocol itself is unaware of its payload content in most circumstances, and this is why most network hardware has minimal complexity yet remains interoperable.

I assure you every packet of content is encapsulated in several standard responses.

Have a nice day =)


I don't really get the hate for TCP here. No one is forcing others to use TCP. It is perfectly fine to just use UDP if TCP is not the right fit for your scenario, and build whatever semantics you need on top of that. QUIC did it. What can't the author?


So... basically another reinvention of UDP? I'm not entirely sure what "dataceter computing" is supposed to mean, given that a datacenter can be hosting machines for a very wide array of applications with very different network requirements.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: