Jun 3, 2024

Dropping the last packet of an HTTP transfer.

Article Style.

This is presented as a stream of conciousness ramble as I explored answering the questions asked.

Packet Loss.

As I wrote about in Simulating Poor Network Conditions, packet loss is a complicated thing, and network performance is still a hot topic on tech twitter and in general.

I recently got into an argument with an experienced engineer, where I made the case that packet loss is particularly deadly for SYN, but not really for the last data carrying packet of a TCP flow. He argued losing the last packet is equally deadly.

My argument was the last “meaningful” packet is always followed by the 4 way FIN teardown exchange, and unless you lose both the last packet and the FIN. The sequence number on the FIN will let you know there is a gap in the flow, and you will be able to do a fresh ACK to trigger a [TCP fast retransmit]

So this article is a dive into some of the weeds on packet loss and the linux kernel behaviour.

Does losing the initial SYN have an outsize impact on small HTTP/1.1 transfers ?
Does losing the initial SYN+ACK have an outsize impact on small HTTP/1.1 transfers ?
Does losing the final packet with reply content have an outsize impact on small HTTP/1.1 transfers ?

Getting Setup to Test.

To start, I wrote a very simple and very dumb rust program which listens for HTTP/1.1 requests, and returns about 2.8 packets worth of response, which is what the (1400 / 16) * 3 is all about. Obviously this application is as dumb as it comes, but that is the point, we want to be simple and dumb.

use std::io::prelude::*;
use std::net::TcpListener;
use std::net::TcpStream;

fn main() {
    const HOST: &str = "0.0.0.0";
    const PORT: &str = "8888";
    let endpoint: String = HOST.to_owned() + ":" + PORT;
    let listener = TcpListener::bind(endpoint).unwrap();
    println!("Web server is listening at {}:{}", HOST, PORT);

    for stream in listener.incoming() {
        let _stream = stream.unwrap();
        handle_connection(_stream);
    }
}

fn handle_connection(mut stream: TcpStream) {
    // Make more than 2, but less than 3 "normal" packets.
    let contents: String = "0123456789abcdef".repeat((1400 / 16) * 3);
    let get = b"GET / HTTP/1.1\r\n";
    let mut buffer = [0; 1024];
    stream.read(&mut buffer).unwrap();

    if buffer.starts_with(get) {
        let response = format!(
            "HTTP/1.1 200 OK\r\nContent-Length: {}\r\n\r\n{}",
            contents.len(),
            contents
        );
        stream.write(response.as_bytes()).unwrap();
        stream.flush().unwrap();
    }
}

Then we test this with curl.

$ curl -v --no-keepalive http://127.0.0.1:8888
*   Trying 127.0.0.1:8888...
* Connected to 127.0.0.1 (127.0.0.1) port 8888
> GET / HTTP/1.1
> Host: 127.0.0.1:8888
> User-Agent: curl/8.8.0
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 200 OK
< Content-Length: 4176
< 
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef...

If we watch this with tcpdump, we see.

$ tcpdump -n -i lo tcp port 8888
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:27:04.135888 IP 127.0.0.1.50006 > 127.0.0.1.8888: Flags [S], seq 806780181, win 65495, options [mss 65495,sackOK,TS val 1691776208 ecr 0,nop,wscale 7], length 0
23:27:04.135916 IP 127.0.0.1.8888 > 127.0.0.1.50006: Flags [S.], seq 2720493902, ack 806780182, win 65483, options [mss 65495,sackOK,TS val 1691776208 ecr 1691776208,nop,wscale 7], length 0
23:27:04.135934 IP 127.0.0.1.50006 > 127.0.0.1.8888: Flags [.], ack 1, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0
23:27:04.135998 IP 127.0.0.1.50006 > 127.0.0.1.8888: Flags [P.], seq 1:78, ack 1, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 77
23:27:04.136004 IP 127.0.0.1.8888 > 127.0.0.1.50006: Flags [.], ack 78, win 511, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0
23:27:04.136108 IP 127.0.0.1.8888 > 127.0.0.1.50006: Flags [P.], seq 1:4218, ack 78, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 4217
23:27:04.136135 IP 127.0.0.1.50006 > 127.0.0.1.8888: Flags [.], ack 4218, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0
23:27:04.136159 IP 127.0.0.1.8888 > 127.0.0.1.50006: Flags [F.], seq 4218, ack 78, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0
23:27:04.136279 IP 127.0.0.1.50006 > 127.0.0.1.8888: Flags [F.], seq 78, ack 4219, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0
23:27:04.136294 IP 127.0.0.1.8888 > 127.0.0.1.50006: Flags [.], ack 79, win 512, options [nop,nop,TS val 1691776208 ecr 1691776208], length 0

And we found our first problem. My idea of assuming this transaction will be broken up into 3 packets is broken, because although the MTU (Maximum Transmission Unit) of most internet connections and many ethernet connections is 1500 bytes, localhost has no need for such limitations, and it has an MTU of 65536.

$ ip addr | grep "lo:" | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000

It actually turns out the loopback interface cheats a bunch of rules, and even manually setting a lower MTU on it won’t change this. You might think that this is easy, just get my local LAN address on my Ethernet interface and use that instead, but nope !

$ ip route get to 192.168.0.18
local 192.168.0.18 dev lo src 192.168.0.18 uid 1000

Linux is helpful, and goes and makes this also run over the loopback interface. Most of the time this is super helpful and will speed up peformance when talking to yourself, but here it is actually a pain.

So lets break out an actual client and server. I’ll get my laptop out. Now we are using wired ethernet, with the laptop as the client, and my desktop as the server. On the desktop, lets capture this and see what we get. Surely now are at going to see packets right !

laptop$ curl -v http://dwg-desktop:8888
...

desktop$ sudo tcpdump -i enp7s0 -n tcp port 8888
...
00:22:23.480387 IP 192.168.0.18.8888 > 192.168.0.56.43260: Flags [P.], seq 1:4218, ack 80, win 509, options [nop,nop,TS val 2152084486 ecr 2388013842], length 4217

Dammit. What is going on here ?

Well, this is a quirk of the Linux kernel, and my networking hardware. I’ve got a Z690 motherboard, with support for some hardware acceleration features for Ethernet. This means the kernel can give the NIC much larger chunks of data than it can actually fit into packets, and the “packetisation” of that traffic is actually handled by the NIC hardware.

We can confirm this with.

$ sudo ethtool -k enp7s0 | grep tcp-segmentation-offload
tcp-segmentation-offload: on

This shows the NIC is doing the work for us. Thankfully we can disable this.

$ sudo ethtool -k enp7s0 | grep tcp-segmentation-offload
tcp-segmentation-offload: on

$ sudo ethtool enp7s0 -K tso off

$ sudo ethtool -k enp7s0 | grep tcp-segmentation-offload
tcp-segmentation-offload: off

Now we are doing this in software, which will obviously slightly slow performance, but will now allow us to actually see each packet at the kernel level, which is going to be needed for these experiments.

If you put a network tap or SPAN port or similar between the client and the server, you would have seen normal sized packets, but this is what we need to do so we can do this easily with what we have to hand.

And now finally confirm.

$ sudo tcpdump -i enp7s0 -n tcp port 8888
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp7s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
00:37:03.124196 IP 192.168.0.56.33180 > 192.168.0.18.8888: Flags [S], seq 892057927, win 32120, options [mss 1460,sackOK,TS val 2388642832 ecr 0,nop,wscale 7], length 0
00:37:03.124227 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [S.], seq 1926036559, ack 892057928, win 65160, options [mss 1460,sackOK,TS val 2152964130 ecr 2388642832,nop,wscale 7], length 0
00:37:03.125628 IP 192.168.0.56.33180 > 192.168.0.18.8888: Flags [.], ack 1, win 251, options [nop,nop,TS val 2388642834 ecr 2152964130], length 0
00:37:03.126097 IP 192.168.0.56.33180 > 192.168.0.18.8888: Flags [P.], seq 1:80, ack 1, win 251, options [nop,nop,TS val 2388642834 ecr 2152964130], length 79
00:37:03.126103 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [.], ack 80, win 509, options [nop,nop,TS val 2152964132 ecr 2388642834], length 0
00:37:03.126205 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [.], seq 1:1449, ack 80, win 509, options [nop,nop,TS val 2152964132 ecr 2388642834], length 1448
00:37:03.126212 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [.], seq 1449:2897, ack 80, win 509, options [nop,nop,TS val 2152964132 ecr 2388642834], length 1448
00:37:03.126213 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [P.], seq 2897:4218, ack 80, win 509, options [nop,nop,TS val 2152964132 ecr 2388642834], length 1321
00:37:03.126225 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [F.], seq 4218, ack 80, win 509, options [nop,nop,TS val 2152964132 ecr 2388642834], length 0
00:37:03.128380 IP 192.168.0.56.33180 > 192.168.0.18.8888: Flags [.], ack 4218, win 249, options [nop,nop,TS val 2388642837 ecr 2152964132], length 0
00:37:03.128823 IP 192.168.0.56.33180 > 192.168.0.18.8888: Flags [F.], seq 80, ack 4219, win 249, options [nop,nop,TS val 2388642837 ecr 2152964132], length 0
00:37:03.128827 IP 192.168.0.18.8888 > 192.168.0.56.33180: Flags [.], ack 81, win 509, options [nop,nop,TS val 2152964135 ecr 2388642837], length 0

Yup, now we have 2 packets with length = 1448, followed by one with length = 1321. These are our data packets. So, finally we can confirm what the code comment said. The “payload” for this traffic comes in 3 main packets. Lets make EXTRA sure of that by having a peek at the packet contents, or at least the start of them.

desktop$ sudo tcpdump -i enp7s0 -n -X -s 160 tcp port 8888
<snipped>

00:42:11.463756 IP 192.168.0.18.8888 > 192.168.0.56.40178: Flags [.], seq 1:1449, ack 80, win 509, options [nop,nop,TS val 2153272470 ecr 2388951169], length 1448
	0x0000:  4500 05dc abb8 4000 4006 07c9 c0a8 0012  E.....@.@.......
	0x0010:  c0a8 0038 22b8 9cf2 a0e9 86b8 1869 600b  ...8"........i`.
	0x0020:  8010 01fd 8769 0000 0101 080a 8058 5496  .....i.......XT.
	0x0030:  8e64 8081 4854 5450 2f31 2e31 2032 3030  .d..HTTP/1.1.200
	0x0040:  204f 4b0d 0a43 6f6e 7465 6e74 2d4c 656e  .OK..Content-Len
	0x0050:  6774 683a 2034 3137 360d 0a0d 0a30 3132  gth:.4176....012
	0x0060:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0070:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0080:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0090:  3334                                     34
00:42:11.463761 IP 192.168.0.18.8888 > 192.168.0.56.40178: Flags [.], seq 1449:2897, ack 80, win 509, options [nop,nop,TS val 2153272470 ecr 2388951169], length 1448
	0x0000:  4500 05dc abb9 4000 4006 07c8 c0a8 0012  E.....@.@.......
	0x0010:  c0a8 0038 22b8 9cf2 a0e9 8c60 1869 600b  ...8"......`.i`.
	0x0020:  8010 01fd 8769 0000 0101 080a 8058 5496  .....i.......XT.
	0x0030:  8e64 8081 6630 3132 3334 3536 3738 3961  .d..f0123456789a
	0x0040:  6263 6465 6630 3132 3334 3536 3738 3961  bcdef0123456789a
	0x0050:  6263 6465 6630 3132 3334 3536 3738 3961  bcdef0123456789a
	0x0060:  6263 6465 6630 3132 3334 3536 3738 3961  bcdef0123456789a
	0x0070:  6263 6465 6630 3132 3334 3536 3738 3961  bcdef0123456789a
	0x0080:  6263 6465 6630 3132 3334 3536 3738 3961  bcdef0123456789a
	0x0090:  6263                                     bc
00:42:11.463762 IP 192.168.0.18.8888 > 192.168.0.56.40178: Flags [P.], seq 2897:4218, ack 80, win 509, options [nop,nop,TS val 2153272470 ecr 2388951169], length 1321
	0x0000:  4500 055d abba 4000 4006 0846 c0a8 0012  E..]..@.@..F....
	0x0010:  c0a8 0038 22b8 9cf2 a0e9 9208 1869 600b  ...8"........i`.
	0x0020:  8018 01fd 86ea 0000 0101 080a 8058 5496  .............XT.
	0x0030:  8e64 8081 3738 3961 6263 6465 6630 3132  .d..789abcdef012
	0x0040:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0050:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0060:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0070:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0080:  3334 3536 3738 3961 6263 6465 6630 3132  3456789abcdef012
	0x0090:  3334                                     34
00:42:11.463763 IP 192.168.0.18.8888 > 192.168.0.56.47184: Flags [F.], seq 4218, ack 80, win 509, options [nop,nop,TS val 2154175382 ecr 2389737516], length 0

These are the 3 data packets. The first contains the HTTP/1.1, 200 OK header, and the Content-Length: 4176 header. Then the 0x0d0a0d0a sequence for \r\n\r\n as defined in the HTTP/1.1 RFC spec to seperate header from body.

Then finally the next two packets consist entirely of IP and TCP header information, followed by the payload of repeating strings of “0123456789abcdef”.

This is followed by the length 0 packet, with the TCP FIN flag set, indicating we want to tear down the connection with the 4 way termination handshake.

Testing multiple times.

So we want to run a battery of tests in this very controlled environment, which produces basically the exact same packets with each run. This will allow us to introduce random failures (and some less than random failures) as in my previous tests and see the results.

Lets make a hacky shell script to do this test multiple times, and collect and examine the results. To avoid getting things even more complicated, you are going to need a real program for time, and not the bash keyword time. On Debian based systems, you may need to install this with sudo apt install time, and when using it, refer to it directly as /usr/bin/time.

Here is our script.

#!/usr/bin/env bash

let i=100

rm timings.txt

while [ $i -gt 0 ]; do
    /usr/bin/time -a -o timings.txt -f "%e" curl -s http://dwg-desktop:8888 > /dev/null
    i=$((i-1))
done

cat timings.txt | sort -n | uniq -c

And as expected, running it produces the results that all these curl connections are currently quite quick.

$ ./multitest.sh
     94 0.02
      6 0.03

Testing with 50% packet loss.

Cribbing from my last blog, we can prove that we can really mess with this connection by adding a lot of random packet loss. Lets add 50% packet loss to what the client is sending (so the request itself, the ACK’s and the FIN’s to teardown the connection) and see what happens to the performance.

laptop$ sudo iptables -A OUTPUT -p tcp --dport 8888 -m statistic --mode random --probability 0.5 -j DROP

laptop$ /multitest.sh
     14 0.02
      3 0.03
      3 0.22
     12 0.23
      1 0.29
      3 0.64
      2 1.04
      9 1.05
      1 1.06
      1 1.22
      1 1.23
      1 1.24
      3 1.25
      1 1.26
      1 1.46
      1 1.47
      2 1.48
      1 1.49
      1 1.50
      1 1.64
      2 1.65
      2 1.69
      1 2.05
      4 2.06
      3 2.07
      1 2.08
      1 2.26
      1 2.28
      1 2.48
      1 3.09
      3 3.10
      1 3.12
      1 3.29
      1 3.30
      1 3.31
      1 3.52
      3 4.11
      1 4.12
      1 4.30
      1 4.33
      1 4.34
      4 5.01
      1 5.59

Yup, that sure looks AWFUL, which with 50% packet loss isn’t exactly a surprise. One thing that is interesting here though is we have around 17/100 connections that were totally unimpacted (0.02 or 0.03 seconds), and then 15/100 more connections that only lost 20ms (the ones with 0.22 and 0.23).

Then we have a lot more, but they are all clustered around 20ms apart.

Lets bucket these into 20ms buckets and throw them into a histogram.

import matplotlib.pyplot as plt

def load_file(filename):
    data = []
    with open(filename, 'r') as f:
        for line in f:
            event, time = map(int, line.split())
            data.append((event, time,))
    return data

def create_histogram(data):
    events, times = zip(*data)
    plt.hist(times, bins=50)
    plt.xlabel('Time (seconds)')
    plt.ylabel('Frequency')
    plt.title('Event Times')
    plt.show()

if __name__ == "__main__":
    filename = 'events.txt'  # Replace with your file name
    data = load_file(filename)
    create_histogram(data)

Now lets check what happens when we bin packets going in the other direction, from the server to the client. So this will be the SYN+ACK, the 3 data packets we looked at earlier, the FIN and then the FIN+ACK.

We will clear up iptables on the laptop, and add this to the server.

laptop$ sudo iptables -D OUTPUT -p tcp --dport 8888 -m statistic --mode random --probability 0.5 -j DROP

desktop$ sudo iptables -A OUTPUT -p tcp --sport 8888 -m statistic --mode random --probability 0.5 -j DROP

laptop$ ./multitest.sh
     27 0.02
      3 0.03
      2 0.04
      4 0.23
      1 0.24
      1 0.25
      1 0.26
      1 0.30
      1 0.69
      1 0.70
      1 0.71
      2 1.02
      6 1.03
      8 1.04
      8 1.05
      7 1.06
      1 1.27
      1 1.30
      1 1.31
      2 1.37
      1 1.55
      1 1.56
      1 1.74
      1 2.05
      1 2.06
      2 2.07
      1 2.24
      1 2.35
      1 2.37
      1 2.71
      2 3.10
      1 3.18
      2 3.41
      1 3.42
      1 3.68
      1 5.01
      1 8.71

Well, this data is different !. We can now tell that not all loss is equal, but you would likely have assumed that before any of this testing began. 50% is also quite a brutal amount of loss, and realistically, 50% loss is going to make any internet connection totally and utterly unusable.

Applying these methods, we will go back and do this test again but with a more realistic amount of packet loss, 3%, and we will run the test 10,000 times instead of 100 times. I won’t walk through each individual step to do this as I both hope it is obvious and I’ve been writing this post for long enough already !

# 3% client -> server loss.
     11 0.01
   8732 0.02
    840 0.03
     60 0.04
      9 0.05
      7 0.06
      1 0.07
      1 0.09
     12 0.13
     26 0.14
      4 0.15
      2 0.23
      2 0.24
      1 0.25
      1 0.27
      2 0.29
      1 0.30
      1 0.32
     34 1.02
     65 1.03
     93 1.04
     70 1.05
     16 1.06
      2 1.07
      1 1.08
      1 1.10
      1 1.12
      1 1.30
      1 1.35
      1 1.38
      1 1.52
      1 2.07

We would expect 97% of our connections to be unaffected, and (11 + 8732 + 840 + 60) / 10000 * 100 = 96.33% do seem pretty much unimpacted, and the 0.05’s and 0.06’s could be “normals” as well. This is certainly close enough I’m happy with it.

But then we notice a cluster of events around the 1 second mark. In fact, that is where almost all our other timings are. Why ?

Well, a reasonable theory would ask what happens when the very first packet (the TCP/SYN) gets lost. Or what happens when the GET request gets dropped ?.

So we can test for that. We just need to make filters that drop those things specifically. Lets start out with the easy one. Lets make the SYN’s always fail the first time, and work the second. We can go back to 100 trials now we are being more destructive in our tests.

Testing dropping the SYN’s.

// Cleanup the last test.
desktop$ sudo iptables -A OUTPUT -p tcp --sport 8888 -m statistic --mode random --probability 0.5 -j DROP

// Match every other SYN packet.
// The SYN SYN part says to examine the SYN flag and it must be set.
// Check the iptables-extensions(8) man page for more detail.
laptop$ sudo iptables -A OUTPUT -p tcp --tcp-flags SYN SYN --dport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP

laptop$ ./multitest.sh 
      2 1.03
     14 1.04
     66 1.05
     19 1.06

Interesting and kind of obvious. If our TCP SYN fails to send, we have no way to know anything goes wrong other than to sit and wait and when we get no SYN+ACK back, try again. We should be able to modify the drop to every 3rd packet and get a different result. If our theory is correct, we should expect 50% of connections to work fine, and 50% of the connections to have the 1 second delay. Lets confirm.

laptop$ sudo iptables -D OUTPUT -p tcp --tcp-flags SYN SYN --dport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP
laptop$ sudo iptables -A OUTPUT -p tcp --tcp-flags SYN SYN --dport 8888 -m statistic --mode nth --every 3 --packet 0 -j DROP
laptop$ ./multitest.sh 
     40 0.02
      6 0.03
      4 0.04
      7 1.02
     10 1.03
     17 1.04
     16 1.05

Looks like theory confirmed. Is there anything we can actually do to fix this ?. Where does this one second delay actually come from ?. Turns out Linux has an exponential backoff for SYN’s, and again we can confirm this with some experimentation. Lets just drop 100% of the SYN’s, and watch what happens.

laptop$ sudo iptables -D OUTPUT -p tcp --tcp-flags SYN SYN --dport 8888 -m statistic --mode nth --every 3 --packet 0 -j DROP
desktop$ sudo iptables -A INPUT -p tcp --tcp-flags SYN SYN --dport 8888 -j DROP

laptop$ curl -v -s http://192.168.0.18:8888
*   Trying 192.168.0.18:8888...
* ipv4 connect timeout after 5000ms, move on!
* Failed to connect to 192.168.0.18 port 8888 after 5002 ms: Timeout was reached
* Closing connection

Okay, so curl told us it gave up after 5 seconds, but what did it do during those 5 seconds. Lets look at the tcpdump on the laptop to see what we are actually sending.

laptop$ sudo tcpdump -i wlp2s0 -n tcp port 8888
09:18:20.263094 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397837211 ecr 0,nop,wscale 7], length 0
09:18:21.280085 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397838228 ecr 0,nop,wscale 7], length 0
09:18:22.308096 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397839256 ecr 0,nop,wscale 7], length 0
09:18:23.328095 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397840276 ecr 0,nop,wscale 7], length 0
09:18:24.352108 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397841300 ecr 0,nop,wscale 7], length 0
09:18:25.352118 IP 192.168.0.56.59442 > 192.168.0.18.8888: Flags [S], seq 627248382, win 32120, options [mss 1460,sackOK,TS val 2397841300 ecr 0,nop,wscale 7], length 0

Okay, so that is easy to figure out, we have 6 goes at sending SYN’s, with 1 second between each.

We send 6 SYN’s because

laptop$ sudo sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

Super confusingly this variable is also used for ipv6, there is no address family neutral or specific ipv6 option. The ipv6 code just reads that ipv4 option.

If we tweak that option, we can send more or less SYN retries, but realistically if you send 6 and they fail, you are going to have a bad time.

Interestingly, if you go back to pre-2023 kernels, you will get a different behaviour, you will get exponential backoff, so usually a 1 second wait, then a 2 second, then 4, 8, 16 etc …

But this patch added a sysctl based on a 2009 IETF presentation to implement linear timeouts for a least the first N timeouts.

Sidetracking into Initial SYN Timeouts.

What you might hope is that the 1 second timeout would be configurable. It … is not. Of course linux is open source, so it actually IS, but you likely don’t want to. This isn’t a blog on building your own kernel, but you just do …

test-vm# cd /usr/src/linux-source-6.8
test-vm# vi include/net/tcp.h
:: 
   change #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	
   to     #define TCP_TIMEOUT_INIT ((unsigned)(0.1*HZ))	
::

test-vm# make
test-vm# make modules
test-vm# make modules_install
test-vm# make install
test-vm# reboot

And test.

test-vm$ curl http://192.168.0.18:8888

test-vm# tcpdump -n -i enp1s0 tcp port 8888
22:33:58.573685 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734875974 ecr 0,nop,wscale 7], length 0
22:33:58.681080 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876082 ecr 0,nop,wscale 7], length 0
22:33:58.785089 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876186 ecr 0,nop,wscale 7], length 0
22:33:58.889106 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876290 ecr 0,nop,wscale 7], length 0
22:33:58.993153 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876394 ecr 0,nop,wscale 7], length 0
22:33:59.097038 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876498 ecr 0,nop,wscale 7], length 0
22:33:59.305043 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734876706 ecr 0,nop,wscale 7], length 0
22:33:59.713199 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734877114 ecr 0,nop,wscale 7], length 0
22:34:00.569003 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734877970 ecr 0,nop,wscale 7], length 0
22:34:02.233100 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734879634 ecr 0,nop,wscale 7], length 0
22:34:05.497085 IP 192.168.122.136.51320 > 192.168.122.1.8888: Flags [S], seq 67404136, win 64240, options [mss 1460,sackOK,TS val 734882898 ecr 0,nop,wscale 7], length 0

Well that works. Here we can see the first 6 SYN’s are all now ~100ms apart, and then we start the exponential backoff. So the glory of open source, if you REALLY want something done, go do it.

Okay, that is enough sidetracking, I think it is time for a goal review.

Further Sidetracking: I lied, here is another sidetrack. I made a very hacky bit of BPF code to do this, but it is not very robust. It is very much a proof of concept hack, not something to deploy in production.

Goal1:

Does losing the initial SYN have an outsize impact on small HTTP/1.1 transfers ?
- Yes, it really does. SYN’s getting lost with Linux clients will add 1 second to any connection that needs one. Tweaking this is only possible via recompling kernel with new options.

Loss of SYN+ACK.

Is losing the SYN+ACK (the 2nd packet of the handshake) going to have the same impact ?. Lets drop the SYN filter, and add the SYN+ACK filter.

desktop# iptables -D OUTPUT -p tcp --tcp-flags SYN SYN --dport 8888 -j DROP
desktop# iptables -A OUTPUT -p tcp --tcp-flags SYN,ACK SYN,ACK --sport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP

laptop$ ./multitest.sh 
      1 0.02
     97 1.05
      2 1.06

Okay, so these results. 1 freakishly just worked, no idea but I’m just going to blame glitch. If Meatloaf said 2 out of 3 ain’t bad, 99 out of 100 must be just fine right ?. The other two connections take that one second penalty for the SYN+ACK being dropped and so needing to wait out that TCP_TIMEOUT_INIT.

Does lowering TCP_TIMEOUT_INIT here help ?. We didn’t test it here with our new kernel as we only made the change on the client, and not on the server, and we are no dependant on the servers TCP stack.

So lets swap roles, run the server on the VM and the multitest script on my desktop. Thankfully as the program is in rust I can just copy the binary, nice !. A quick edit of multitest.sh to point at the VM and we are good to go.

desktop$ sudo iptables -D OUTPUT -p tcp --tcp-flags SYN,ACK SYN,ACK --sport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP
desktop$ sudo iptables -A INPUT -p tcp --tcp-flags SYN,ACK SYN,ACK --sport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP

desktop$ ./multitest.sh 
      5 0.10
     69 0.11
     26 0.12

Bingo, mucking around with the kernel gets that time down to 1/10th of a second.

Final Goal: Loss of final data packet.

So before we have been matching on TCP flags, but this obviously won’t work for the final test. What will work is the packet size though. I

desktop$ sudo iptables -A INPUT -p tcp --tcp-flags SYN,ACK SYN,ACK --sport 8888 -m statistic --mode nth --every 2 --packet 0 -j DROP
desktop$ sudo iptables -A OUTPUT -p tcp --sport 8888 -m length --length 1321 -m statistic --mode nth --every 2 --packet 0 -j DROP
laptop$ ./multitest.sh 
      1 0.24
     15 0.25
      2 0.26
      1 0.28
      5 0.29
     38 0.30
      6 0.31
      1 0.32
      2 0.34
      4 0.35
      2 0.36
      3 0.45
      1 0.46
      1 0.49
      4 0.50
      9 0.51
      1 0.55
      2 0.56
      2 0.61

So this is interesting. We get to see two main spikes at 0.25 and 0.30, but with a large smattering tail of times going higher. This is tricky to figure out what is going on at first pass.

Well, what is happening here it linux actually maintains a cache of how well it can talk to particular hosts, and saves some data including the estimated latency and the jitter to a specific host, and those values get better the more and more it talks to that host.

So here, it started out taking around 0.5 seconds to resend the missing packet, but as it realised the host was able to return faster than that, it went down to the lowest value it eventually settled on at 0.25 seconds. This explains the tail.

You can view details of this cache with :-

$ ip tcp_metrics show 192.168.0.18
192.168.0.18 age 263.708sec cwnd 10 rtt 7390us rttvar 2809us source 192.168.0.56

We can test this further by specifically looking at what stats we get on a distant source. Here I’m going to download a file from a host on the other side the world, and then observe the result in metrics.

$ wget -4 http://nz.archive.ubuntu.com/ubuntu-releases/noble/ubuntu-24.04-netboot-amd64.tar.gz

$ ip tcp_metrics show | grep 163.7.134.112
163.7.134.112 age 0.768sec cwnd 10 rtt 299178us rttvar 205225us source 192.168.0.56

Here we get to see we ended up with a saved cache entry of 299ms, and a variability of 205ms. I’m not exactly sure what statistics are used to generate that rttvar so I’m going to give it a good ignoring for now.

HOWEVER, there is a major gotcha here. We are doing something that is atypical. We are using the HTTP/1.1 protocol like you would have done so in the 1990’s and not like the 2020’s. Most importantly, we aren’t setting Connection: keep-alive with curl, for obvious reasons. The connection is going to be torn down after curl finishes running and the program terminates. We can demonstrate this with tshark.

$ curl http://192.168.0.56:8888

# tshark -O http -f "tcp port 8888" -Y "http.request"
Hypertext Transfer Protocol
    GET / HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): GET / HTTP/1.1\r\n]
            [GET / HTTP/1.1\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: GET
        Request URI: /
        Request Version: HTTP/1.1
    Host: 192.168.0.56:8888\r\n
    User-Agent: curl/8.8.0\r\n

But if I do the same with a real web browser …

Hypertext Transfer Protocol
    GET / HTTP/1.1\r\n
        [Expert Info (Chat/Sequence): GET / HTTP/1.1\r\n]
            [GET / HTTP/1.1\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Request Method: GET
        Request URI: /
        Request Version: HTTP/1.1
    Host: 192.168.0.56:8888\r\n
    User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0\r\n
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8\r\n
    Accept-Language: en-US,en;q=0.5\r\n
    Accept-Encoding: gzip, deflate\r\n
    Connection: keep-alive\r\n

So does the impact things, and how can we hook up a test apparatus to actually usefully test this and collect time to complete metrics. Our old hack of just running curl against it and timing how long the process takes won’t work if we only run curl once for all 100 connections.

Time for some more code.

// Hacky Test Rust Client.
use reqwest::blocking::Client;
use std::time::Instant;
use std::time::Duration;
use std::thread::sleep;

fn main() -> Result<(), reqwest::Error> {
    let client = Client::new();
    let url = "http://192.168.0.56:8888/";
    let fiftyms = Duration::from_millis(50);

    for _i in 0..100 {
        let now = Instant::now();
        let _response = client.get(url).send()?;
        let elapsed = now.elapsed();
        println!("{:?}", elapsed.as_millis());
        sleep(fiftyms);
    }
    Ok(())
}

This runs 10 requests against the server code, but is doing them all from a single client, which will maintain the same session to the server. The first connection will require the SYN/SYNACK/ACK three way handshake, but the rest can both just send a GET /. Later connections will also benefit from the network stack learning the RTO and scaling the window, but we don’t send enough data in these test for that to actually be particularly significant.

The results.

$ cargo run
5
2
2
3
3
2
2
3
3

As expected, the first request takes longer than the others.

So lets add back in that filter on packet size, so we are “dropping the last useful packet”, and see what happens ?

$ cargo run
7
4
4
3
4
3
5
3
4
5

And here we can see almost no impact at all. Everything got slightly slower because of the loss and the need to signal for a retranmission and then to get one, but there is no long pause while waiting for a timeout.

If instead the final data packet also contained the “FIN” flag, there would be a timeout pause. There wouldn’t be a packet with a higher sequencer number arriving to trigger the fast retramsission, and so the receiver would never ask for it. It would be up to the sender to wait a timeout on the ACK returning and then retrasmit the packet.

Note: It is an implementation QUIRK, not a spec or RFC requirement that results in the last packets with data being seperate from the empty packet with the FIN flag set. In different circumstances, or with different operating systems you might have colocated final datastream content and FIN’s.

So what would be that timeout ?

We saw earlier how the timeout policy for a missing initial SYN in Linux set and some hacks to change it.

What about the unacknowledged datastream packets that can’t be handled via Fast Retransmission and Selective Acknowledgement, like here when there are no subsequent duplicate ACK’s, or SACK’s to work with ?

Well, that timeout is dynamic. This ss command will show you a lot of detail about your existing TCP connections.

$ ss -n -6 -t -e -i
ESTAB 0 0 [2a06:5902:aaaa:bbbb:cccc:dddd:eeee:ffff]:40516 [2a00:1450:4009:81f::200e]:443 uid:1000 ino:133282636 sk:1030 
     cgroup:/user.slice/user-1000.slice/user@1000.service/app.slice/app-gnome-firefox-5135.scope <-> ESTAB
     0 ts sack cubic wscale:8,7 rto:220 rtt:18.673/1.891 ato:40 mss:1380 pmtu:1500 rcvmss:1380 advmss:1428 cwnd:10 
     bytes_sent:1395 bytes_acked:1396 bytes_received:95829 segs_out:55 segs_in:85 data_segs_out:8 data_segs_in:80 
     send 5912280bps lastsnd:13552 lastrcv:13536 lastack:13536 pacing_rate 11824000bps delivery_rate 1559024bps 
     delivered:9 app_limited busy:144ms rcv_rtt:16.676 rcv_space:48300 rcv_ssthresh:201196 minrtt:14.55 
     snd_wnd:69632 rcv_wnd:201216

In the example above you can see

I have a connection from me (with the leftmost redacted IPv6 address and the dynamic port).
This connection is HTTPS to a Google server (with the rightmost redacted IPv6 address and the static port of 443).
The app for this connection is firefox.
The measured Round Trip Times of the connection are 18.673/1.891 ms. The first value is the average round trip time, and the second is a measure of the variance of the round trip time. I’m don’t know the exact formula for how this variance is calculated, but a higher number implies a more variable round trip time.
rcv_rtt is the round trip time of the last packet received.
minrtt is the minimum round trip time seen for a packet.
And a lot more data about the connection I won’t go into here (and in some cases I’m ignorant of what it means).

I know at one point in the history of the Linux kernel, it would wait for the (RTT average * 2) + 20ms)). I went looking for this code for this post so I could confirm it, but I couldn’t find it in the latest 6.10-rc3 kernel. I’m not sure if this is still the case, or if it has been changed.

Finally it is important to note everything here is implementation dependant. Just because I’ve shown or suggested something works a particular way, doesn’t mean it will on your system. As shown by the exponential backoff vs 1 second linear backoff, this isn’t universal from one Linux kernel to another, let alone with OSX, Windows or embedded systems.

Takeaways.

99% of the time, you can ignore everything written here. Just use TCP or some other abstraction (websockets/QUIC) and let it handle the details for you.
About 0.9% of the time, knowledge like this will help you understand weird edge cases, or consider tuning options.
About 0.1% of the time, you will need to deep dive something I’ve written about here to get something done.
Some really tiny amount of the time, you are writting your own implementations of everything, either directly or like QUIC does as a layer over UDP.