Maximum Transmission Unit (MTU)
The Maximum Transmission Unit (MTU) is the size of the largest Protocol Data Unit (PDU) (TCP segment or UDP datagram) can be communicated in a single network layer transaction.
Larger MTUs:
- have more bandwidth efficient (more user data for the same amount of overhead; ditto for packet processing)
- have larger latency and jitter (ref. TCP_NODELAY, Nagle’s algorithm)
- are more sensitive to bit errors and faulty devices/configuration
The link MTU is essentially constrained by the maximum data link and thence physical layer PDUs; typically 1500 bytes for Ethernet (1501 - 9200+ for “jumbo frames”). For IPv4, min. 576 bytes; IPv6 1280 bytes.
The Path MTU is the end-to-end smallest MTU along a path.
When an application on an IPv4 end-host attempts to send a datagram larger than the the outgoing link’s MTU, the OS socket interface may: segment it directly; and/or pass it on to the network layer for fragmentation; or raise an error; all depending on the OS and socket options in use. RTFM.
When an IPv4 intermediary device (e.g. router) attempts to forward a packet that is larger than the outgoing link’s MTU, it may either fragment it or discard it. In the latter case, the discard is silent unless the ‘Don’t Fragment’ (DF) IP flag is set, in which case the device should notify the sender by sending an ICMP datagram too big message back (aka. fragmentation required). IPv6 routers are not permitted to fragment packets and must send back an ICMPv6 ‘Packet Too Big’.
IP fragmentation is evil to be avoided (https://tools.ietf.org/id/draft-ietf-intarea-frag-fragile-14.html; Digital, 1987)
- when one fragment is lost, the entire transport layer datagram (segment) is lost
- many firewalls/etc will drop fragmented traffic as it opens myriad attack vectors
- forwarding performance can tank (The Drawbacks of IP Fragmentation « ipSpace.net blog)
- misguided network operators can block the necessary ICMP traffic
- load balancers often don’t correctly associate ICMP traffic with the right TCP sessions, e.g. Path MTU discovery in practice
- not so much a fragmentation problem, rather an indirect ICMP blackholing
Path MTU Discovery (PMTUD)
- to avoid IP fragmentation the application should send messages (TCP segments / UDP datagrams) that when packetised are no larger than the Path MTU (the maximum size of a packet that can make it across without fragmentation)
- Path MTU Discovery « ipSpace.net blog
- IP Fragmentation in Detail
- https://theojulienne.io/2020/08/21/mtu-mss-why-connections-stall.html (hands-on labs, if you want to play)
- normally the OS will probe and keep track of path MTUs per destination host for connection-oriented sockets (i.e. TCP) - this is Path MTU Discovery - RFC 1191 describes the process, but it’s basically: “for packets larger than the current PMTU, set the DF flag; if you get an ICMP fragmentation needed back, reduce the PMTU” - RFC 1191 puts that the initial guess for the path MTU should be the smaller of 576 or the local link MTU, which could be quite large for some media (e.g. Ethernet jumbo frames)
- for Linux ref. ip(7): IPv4 protocol implementation - Linux man page
it’s entirely up to the OS when to do PMTUD: Exactly when is PMTUD performed?
- TCP applications use the PMTU transparently - the socket interface breaks the TCP stream from the application into appropriate sized segments to be passed to the IP layer, and the OS handles it from there
- the TCP Maximum Segment Size (MSS) is basically the same idea as the IP layer MTU, and is (usually) set via the PMTU; if PMTUD is disabled for a given path (e.g. blocking ICMP messages) then setting an interface’s MSS can be used to artificially restrict message sizes
- TCP will establish a reasonable MSS for each direction during the initial TCP handshake, from then on it’s fixed for that connection; some hosts and routers can modify the MSS field during the handshake (“MSS clamping”, and Cisco’s
ip tcp adjust-mss
)
- TCP will establish a reasonable MSS for each direction during the initial TCP handshake, from then on it’s fixed for that connection; some hosts and routers can modify the MSS field during the handshake (“MSS clamping”, and Cisco’s
- the TCP Maximum Segment Size (MSS) is basically the same idea as the IP layer MTU, and is (usually) set via the PMTU; if PMTUD is disabled for a given path (e.g. blocking ICMP messages) then setting an interface’s MSS can be used to artificially restrict message sizes
- UDP applications are on their own: if they send excessively large datagrams without DF set, they will get fragmented; if the DF bit is set they will get dropped
- tunnelling/VPNs can exacerbate the problem by hiding the real MTU and amplifying fragmentation
- e.g. out of the box OpenVPN has a number of configuration settings that essentially attempt to “outwit” broken PMTUD environments, for both TCP and UDP; OpenVPN also does its own fragmentation that, when configured incorrectly, will lead to further fragmentation at the IP layer
- tunnelling/VPNs can exacerbate the problem by hiding the real MTU and amplifying fragmentation
- applications can set socket options to control things including fragmentation and PMTUD, YMMV across OS
- inspecting socket options:
- macOS:
netstat -va
can report the socket options in play for a given socket however does so in an undocumented form (a bit mask) lsof
can report sock opts on some platforms but Linux and macOS aren’t two of them
- macOS:
- inspecting socket options:
Is PMTUD working?
- PMTUD will not work where the necessary ICMP traffic is dropped
- there’s a reason for the ‘C’ in ICMP, blocking ICMP is bad form
- symptoms include
- some traffic gets through yet some is dropped, e.g.
- connections start out fine but then stop working (e.g. initial ssh is ok, then hang)
- web pages load, but not the images (or only some of them if they’re served by functioning sites, e.g. ad servers!)
- small “control plane” data is fine, but bulk data doesn’t
- TCP traffic mostly works fine, but (larger sized) UDP traffic tanks
- note that the issue may be for all traffic (where the broken PMTUD is near the user, e.g. a PPoE link following a misconfigured gateway router), or for only certain destinations (where the broken PMTUD is near the server)
- some traffic gets through yet some is dropped, e.g.
- recall the PMTU is associated with a specific remote host and not any local interface - i.e. it’s part of a route cache
- you can ‘DIY’ probe the path MTU by using ping (or other user-space programs) to basically sweep packet sizes with the DF bit set, and see when you start getting ICMP ‘Frag needed’ messages back (or in the case of ping, stop getting replies!)
- this should usually but is not guaranteed to correspond to the PMTU the kernel arrives at through its PMTUD process (e.g. if your probes take a different path due to path updates or protocol differences)
Windows (untested)
1
netsh interface ipv4 show destinationcache
ping:
-l
length;-f
don’t fragment1 2
ping -l 2000 1.1.1.1 ping -l 1472 -f 1.1.1.1
Linux
- the MSS used to be stored in the route cache; that was removed in the ~3.6 kernel and is not readily viewable
ss
(ss(8) - Linux manual page) shows all kinds of socket info, including PMTU, tx/rx MSS etc1 2
# ss --tcp --info ... mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 ...
- DIY probe:
tracepath
(tracepath(8) - Linux man page)
macOS
- no known way to query the kernel
DIY probe: ping sweep (-D: don’t fragment; -g: start -G: end -h: increment), normal case where you eventually try and send something that is just to big to be stuffed into the local interface:
1 2 3 4 5 6 7 8
% ping -D -g 1470 -G 1474 -h 1 10.13.1.2 PING 10.13.1.2 (10.13.1.2): (1470 ... 1474) data bytes 1478 bytes from 10.13.1.2: icmp_seq=0 ttl=62 time=15.219 ms 1479 bytes from 10.13.1.2: icmp_seq=1 ttl=62 time=16.507 ms 1480 bytes from 10.13.1.2: icmp_seq=2 ttl=62 time=17.100 ms ping: sendto: Message too long ping: sendto: Message too long Request timeout for icmp_seq 3
here’s where the an intermediate router is the bottleneck - this local gateway has an MTU of 1460 bytes:
1 2 3 4 5 6 7 8 9
% ping -D -g 1430 -G 1474 -h 1 1.1.1.1 PING 1.1.1.1 (1.1.1.1): (1430 ... 1474) data bytes 1438 bytes from 1.1.1.1: icmp_seq=0 ttl=53 time=12.386 ms 1439 bytes from 1.1.1.1: icmp_seq=1 ttl=53 time=12.962 ms 1440 bytes from 1.1.1.1: icmp_seq=2 ttl=53 time=14.523 ms 556 bytes from 192.168.118.1: frag needed and DF set (MTU 1460) Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 b505 0543 0 0000 40 01 f744 192.168.118.22 1.1.1.1 ...
- ping as an MTU test
- note that the “Message too long” in the macOS example above is a socket error, not a returned ICMP message; i.e. in that case the local link MTU is the (equal) smallest path MTU
- where an intermediate router is the bottleneck, an ICMP fragmentation needed error is returned
- note that on some OS (Linux), you’ll see only one such ICMP message though: the OS takes note of the reduced PMTU and updates the socket accordingly; if you carry on trying to send overly big datagrams you’ll then get socket errors not ICMP responses. e.g. where some intermediate host (10.11.12.13) has a small MTU:
1 2 3 4 5
PING 8.8.4.4 (8.8.4.4) 1472(1500) bytes of data. From 10.11.12.13 icmp_seq=1 Frag needed and DF set (mtu = 1492) ping: local error: Message too long, mtu=1492 ping: local error: Message too long, mtu=1492 ...
- note that the “Message too long” in the macOS example above is a socket error, not a returned ICMP message; i.e. in that case the local link MTU is the (equal) smallest path MTU
ICMP Blackhole Check
- PMTUD is broken when broken routers drop necessary ICMP traffic - aka are an ICMP black hole
- a ping sweep is often an easy way of checking where any bottlenecks are (see above)
- command-line-free test: ICMP blackhole check
- verifies ICMP path MTU packet delivery and IP fragmented packet delivery between your host and the above server (alas, doesn’t seem to provide a means to run your own server instance)
- MTU: artificially responds to traffic from the client with “laughingly small” MTUs via ICMP frag needed
1 2 3
% curl -v -s http://icmpcheck.popcount.org/icmp --data @<(base64 -w 0 /dev/urandom | dd bs=1k count=8) ... {"msg1": "Upload complete", "mtu":1500, "lost_segs":0, "retrans_segs":0, "total_retrans_segs":0, "reord_segs":3, "snd_mss":853, "rcv_mss":853}
- fragmented delivery: sends artificially fragmented segments to client
- verifies ICMP path MTU packet delivery and IP fragmented packet delivery between your host and the above server (alas, doesn’t seem to provide a means to run your own server instance)