From a software point of view, it has always seemed to me that typical MTUs are an order of magnitude too small:
Disk operations as well as CPU operations (authentication, encryption, e.g. TLS) all benefit from using larger block sizes of 65536 bytes and up.
This amortizes system call overhead, enables larger sequential operations, enables more use of SIMD instructions, reduces instruction cache misses (working on one thing for longer), reduces context switches, and serves to increase the leverage that the control plane has on the data plane.
Just one specific example, something like TCP needs to keep track of packets (for retransmission, acks etc.). When you're sending Gbps, small packets mean much more bookkeeping overhead (more method calls, more bits in ack packets, more hash table lookups) compared to large packets. Throughput should be much better with an MTU of 65536 (and latency sensitive applications don't have to send that much).
I can imagine that in the past the cost of memory was radically different, implying a smaller MTU, but it seems like sticking with 1500 for IPv6 was an opportunity missed.
TCP overhead does indeed increase as MTU decreases, but not because "TCP needs to keep track of packets" -- TCP in fact keeps track of bytes, not packets. The reason that overhead is larger for smaller MTUs is simply because of the extra per-packet overheads (looking up connections, allocating memory for ethernet/ip packet headers, (uncontended) locking, repeated function calls to L3/L2 network layers and network drivers, etc).
The fact that TCP tracks bytes rather than packets enables hardware offloads like TSO/LSO (TCP Segmentation Offload, known by some as TCP Large Send) and LRO (Large Receive Offload). TSO allows TCP to treat the effective MTU as large on the transmit side, and to send a single large packet (up to 64K everywhere but windows LSOv2) down to the network card, which then segments it into 1500 byte chunks.
Similarly, on the receive side, LRO can make a small MTU appear large to TCP by transparently merging adjacent TCP segments into one much larger packet which is then passed to TCP. This can be done in hardware or software. Linux calls this GRO.
In early internet there was a clash (IMO unsettled) between fixed-packet-length and dynamic-packet-length networks. Compare ethernet to ATM.
The benefit of ATM is bandwidth guarantees. You can precisely allocate bandwidth to each tenant.
With ethernet you don't have such guarantees. Single tenant can potentially use all the resources. If you allowed jumbo frames in early slow ethernet, you could totally imagine one tenant blocking the line for a-very-long-time, and starving everyone else.
Of course, times had changed, bandwidth shot up, and 1500 is indeed way to small. But there is still a limit. You probably don't want to allow gigantic 1.5MiB frames. Normal jumbo frames of 9KiB are fine. It's just a historical accident they aren't made more popular.
The problem here is ethernet. Even the latest greatest many gigabits per second ethernet can be bridged to plain old coax based 10 Mbps ethernet.
Within the ethernet protocol there is no way to negotiate the maximum frame length between stations. So IEEE standard still has a maximum of 1500 octets for the payload.
There are some RFCs that go beyond 1500, but mostly the IETF follows the IEEE specification.
Note that the 1500 link MTU applies to ethernet (and similar link technologies). For anything else, you can have larger (or smaller) MTUs.
Most network devices support jumbo frames¹, which are an IEEE standard, but they aren't the default. The point is not the IEEE standard, but what most devices are configured for.
Transit operators and IXPs just don't want to configure their equipment to use jumbo frames. It's probably just a coordination problem.²
Jumbo means lots of reassembly on the most common end-link, Wi-fi (which cannot support larger datagrams without huge drop rates due to the error prone nature of the medium). Uncorrectable errors are pretty rare on wired links but regardless, as a general rule the higher the MTU the higher the ratio of bits that need to be retransmitted.
I don't have anything here. I sort of wonder if we misunderstand the problem and combine wifi noise/size with 4G noise/size issues. There are times I think bursty RF signals with big packets might be better than pessimal small packets in chains. At one level, the quoted 5Ghz wifi speeds have to be taken with a pinch of salt, if what you say is true.
I recently spent some time with IPv4 MTU. I don't know how much it applies to IPv6, but my TLDR is MSS (+ size of minimum header) is usually a good indicator of acheivable MTU, but some small fraction of people are misconfigured. A lot of that is syns indicating 1500, but actual MTU is 1492 (pppoe is a scourge). But there's some other issues too, it's hard to track down. Some client oses are capable of probing for mtu black holes, but Android doesn't do it (thanks Google), even though the kernel has it available, and there's no way an app could turn it on (thanks Google). If you want to avoid this problem, you can send back MTU - X, depending on how conservative you want to be. 8, 20 and 50 were good values of X, but you're adding some overhead too.
PPPoE drives me crazy because it's all on your ISP. Would it have been impossible to bump the MTU of the cable/DSL/fiber to 1508? The ISP owns all of that equipment, it seems like something that should be easy, and yet instead there are networks all over the place with a 1492 MTU instead.
MTU issues can be pernicious too, since a lot of stuff will still work but other stuff will fail mysteriously. The webpage loads, but only some of the images for example. Or the unsecure page loads fine, but the secure one times out (cert exchange sends out a full size packet). Or you can SSH until you cat a file and suddenly it freezes. Network nerds will know exactly what the problem is, but regular people the problems are mystifying.
Also, if your security guys ever tell you that they need to disable all ICMP network wide for security reasons, tell them that it will have unintended consequences.
I had to buy a new router since for whatever reason the Apple Airport Extreme wouldn't correctly discover the MTU for PPPoE over my new fiber line (1454), and Apple in all their wisdom are the one router manufacturer that doesn't let you configure it manually...
> but my TLDR is MSS (+ size of minimum header) is usually a good indicator of acheivable MTU, but some small fraction of people are misconfigured.
Just to make this clear: There is nothing misconfigured, the MSS is not "what is guaranteed to fit through all links between you and me" (obviously--how could your computer possibly know what the lowest MTU between you and some random server somewhere on the planet is?), it is simply "the largest TCP segment size you are allowed to send", i.e., larger is essentially guaranteed to fail. It can be used to indicate the largest segment size a low-memory implementation is able to process (which would have nothing to do with MTUs), and most commonly is used nowadays to signal some best guess as to some upper limit to the MTU that the TCP stack can figure out--which in practice means the MTU of the link the connection is being established through, i.e., usually some local ethernet or WiFi
Using the MSS this way is purely a performance optimization, in that it provides a basis for fast discovery of the path MTU (because, obviously, the last hop MTU is known by the client and it's pointless to try sending packets that don't fit through the last hop link). Some home routers also use MSS clamping to work around broken server setups, based on the assumption that they are the only path between the client and the internet--but that's an ugly hack that is in no way required by standards and in any case doesn't guarantee success.
What would you call a client that is unable (or unwilling) to probe for path mtu, and has an mtu set higher than the next hop and doesn't receive ICMP needs frag packets?
I call it misconfigured. Perhaps it's the network that's misconfigured instead. Either way, when these clients send packets of a certain size, the packets never arrive at the server. Having the server send a smaller MSS makes it possible to communicate, but also invokes sadness.
The only thing there that sounds like a misconfiguration is whatever is dropping ICMP frag needed messages, and that is a misconfiguration no matter what.
My point is that no matter what MSS you send, that's never what's at fault for PMTU problems, that's not what MSS is for nor could it possibly do the job in all cases, whereas correct delivery and processing of frag needed/packet too big will always work.
As a quick summary, first of all the host at each end of the connection knows it's MTU and advertises this as MSS in the TCP connection so that in theory the connection is established at the right MTU but this is never tested until a large packet is sent (which may or may not arrive). (the MSS is actually smaller than the MTU since its the data size without headers and even on a given link can change e.g. if IP options are in use.. but in any case it's derived from the MTU.. RTFM for the gory details).
Most consumer routers having an MTU of 1492 due to PPPoE will modify the MSS in-flight of the TCP connection to the MSS it knows for the next hop. This is not a standard feature in other routers though, and obviously wouldn't really work on the internet at large where paths can change mid connection. So it's best used at endpoint routers only.
Ideally such routers might advertise the MTU using DHCP or similar, and it is possible to do this now (and this is commonly used e.g. openstack clouds) however it does unnecessarily then limit the MTU of the local network unnecessarily. In theory I guess it could send a mtu for the default route and local subnet separately but basically in practice no one does this afaik.
Originally, a router along the path would know the MTU was smaller on the subsequent link, and would fragment the packet. For various reasons this is an expensive operation for all involved so we decided to instead take advantage of the "DF" (Don't Fragment) flag to instead generate an ICMP message back to the source asking them to please adjust their MTU/MSS and re-transmit. This flag specifically prevents the router fragmenting the message (that will never happen now) and is the default behavior at least on Linux and I assume most OS now.
This can go wrong in two ways, firstly ICMP may be blocked somewhere so it never arrives even at the end host or by some over zealous firewall. Secondly, the MTU may be misconfigured somewhere at layer 2 so the packet is silently dropped rather than having an ICMP message generated. The ICMP message requires that a router knows* the MTU of the complete l2 path to the next hop. If it get's that wrong, it will just send the packet and assume it is OK, when in reality it is dropped silently. (This same problem can be seen on just a local network, if you configure e.g. 9000 MTU on two hosts but an MTU of 1500 on your switch, your connections will break. It's not purely a routing issue).
There is a second protocol designed to detect PMTU "passively" to solve this problem called TCP PLPMD (Packetization Layer Path MTU Discovery). This basically uses heuristics to detect that packets are being dropped and then start probing the real path MTU by doing something like a binary search between some base MTU (e.g. 512) and the known max MTU (e.g. 1500). It tries various values until it settles on the actual MTU. The problem with this of course, is that packets may be dropped by packet loss rather than MTU issues and so there is a bit of luck and strategy to getting this right - and it also introduces some extra latency while some packets are being dropped.
For whatever reason this option is currently off by default in Linux (e.g. Ubuntu) and if it was on, it would silently fix the MTU on a lot of connections that are currently not fixed. The option is sysctl net.ipv4.tcp_mtu_probing. This has been enabled by default since Windows Vista, which explains why we still see people with their MSS misconfigured since Windows Vista onwards will just fix it. In linux at least it also has two modes, one where it waits to detect an ICMP blackhole and another where it always uses the protocol from the start. The former results in a 1-2s delay before it kicks in to fix a connection, so not ideal if it's happening on every connection but OK for the odd broken connection.
There is of course, like all topics, more to it than this.. but hopefully that helps.
Thanks for the summary. I'm aware of all of this; but from the server side, when clients mysteriously don't send packets of a certain size, it's hard to know why --- because I'm only on the server side, and I haven't had any luck getting knowledgeable people on the other end.
I can tell it's an MTU issue, because I get later packets, but I'm missing a block that happens to be an exact multiple of the MSS. It could be simple packet loss, except this pattern comes up too often, the client will sometimes send probing acks, but never the missing packets, and it never retries with smaller packets. This is especially bad when you're communicating with default Linux servers that always send syn+ack with their local MSS, because so many clients think they can send 1500 byte packets, but they can't. It's less bad when using most versions of FreeBSD [1] where it returns whichever is smaller of client side MSS or server MSS. There are a number of systems/networks that clamp outgoing SYNs but not incoming SYNs, so downloads mostly work, but uploads are broken. Mirroring the sent MSS isn't enough to make every connection work though, because of exciting things you mentioned -- somewhere in the middle, something is dropping large packets, and ICMPs aren't sent or otherwise don't make it to the client, so it just kind of sits around waiting forever.
Edit to add: I wish the codified behavior was simply to truncate IP packets to the size available, rather than either fragment or alert. TCP Peers would be able to notice that only shorter packets get acked, and adjust, and in the meantime, some data would have gotten through. Not terribly great for UDP, but I dunno.
[1] there's a change in -CURRENT to use the Linux behavior, unfortunately
> I wish the codified behavior was simply to truncate IP packets to the size available
If a router could replace the packet with a "truncation sentinel" that's compatible with most firewalls, detectable by new implementations, and doesn't cause data corruption for old implementations, then perhaps truncation could be incrementally deployed? Like a "packet too big" error, but in the forward direction.
However, it might be impossible to construct a packet satisfying all of those conditions.
It's worth noting the big downside to enabling PLPMD is of course that connections which are broken are unlikely to get fixed, but it can result in extra latency or lower initial connection performance which is then hard to diagnose. As opposed to being just broken and fixed by the administrator/network operator/user/etc in the first place.
Most of the time the MTU issues are relatively local to the user, e.g. their router isn't configured with the actual MTU of their internet connection, etc. and can be fixed by them (as opposed to by the ISP etc)
https://news.ycombinator.com/item?id=9001576