Websocket setup time (for the relay) can significantly influence the first RTT sample for the relay path. This affects both smoothed_rtt and rttvar, leading to inflated PTO calculation and slow teardown (possibly related to #3875).
The symptom I'm seeing is ~5s teardown on a short-lived, relatively high latency (100ms rtt) connection. Adding lots of print statements helped me realize that QUIC packets have send_time set at poll_transmit time, but this might not correspond with when the packet is actually sent, depending on the transport. However, the original (incorrect) send_time gets used to measure RTT.
For naked UDP, obviously, writes to the socket are going to happen pretty soon after the packet is created. But with Iroh's websocket transport, the initial packets seem to end up queued and therefore unacked for a long time (~200ms) before they actually get sent. Near as I can tell, this is because the websocket setup (TLS handshake) is still happening.
Here's an unscientific sample of my println debugging from today:
[pkt_sent] cid=e386b8ae47eb214f space=Initial pn=0 now=Instant { tv_sec: 1489788, tv_nsec: 391310583 } size=1200
[relay_send] dst=6200475a87aa253ccee51ff70b231fee890a3c00f59a30179152a9ae770047d9 len=1200 at Instant { tv_sec: 1489788, tv_nsec: 559251166 }
[rtt_sample] cid=e386b8ae47eb214f space=Initial path=PathId(0) sent_at=Instant { tv_sec: 1489788, tv_nsec: 391310583 } rtt=459.655458ms
<snip>
[set_close_timer] cid=e386b8ae47eb214f 3*pto=4.445353683s rtt=RttEstimator { latest:
108.060125ms, smoothed: Some(415.706041ms), var: 260.26963ms, min: 108.060125ms }
You can see a ~170ms delay in between the first handshake packet and the first actual write over the websocket, and that delay gets added to the rtt measurement. Shortly afterwards, on disconnect, rttvar is very high because of that first sample, and rttvar ends up added 12x to the drain timeout.
From RFC 9002 6.2.1:
PTO = smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay
And draining takes 3x PTO. All together, that looks suspiciously like the ~5s of draining that I'm seeing.
Iroh
Version:
0.97 (3d6e26b1182ecf63ce273793d44b3908117211e1)
Websocket setup time (for the relay) can significantly influence the first RTT sample for the relay path. This affects both
smoothed_rttandrttvar, leading to inflated PTO calculation and slow teardown (possibly related to #3875).The symptom I'm seeing is ~5s teardown on a short-lived, relatively high latency (100ms rtt) connection. Adding lots of print statements helped me realize that QUIC packets have
send_timeset atpoll_transmittime, but this might not correspond with when the packet is actually sent, depending on the transport. However, the original (incorrect)send_timegets used to measure RTT.For naked UDP, obviously, writes to the socket are going to happen pretty soon after the packet is created. But with Iroh's websocket transport, the initial packets seem to end up queued and therefore unacked for a long time (~200ms) before they actually get sent. Near as I can tell, this is because the websocket setup (TLS handshake) is still happening.
Here's an unscientific sample of my println debugging from today:
You can see a ~170ms delay in between the first handshake packet and the first actual write over the websocket, and that delay gets added to the rtt measurement. Shortly afterwards, on disconnect,
rttvaris very high because of that first sample, andrttvarends up added 12x to the drain timeout.From RFC 9002 6.2.1:
And draining takes 3x PTO. All together, that looks suspiciously like the ~5s of draining that I'm seeing.
Iroh
Version: