Congestion Control Overhaul#1149
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1149 +/- ##
==========================================
+ Coverage 81.48% 81.51% +0.02%
==========================================
Files 81 81
Lines 25007 25040 +33
==========================================
+ Hits 20378 20412 +34
+ Misses 4629 4628 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1031a6b to
7021ab6
Compare
|
Have overhauled Reno (RFC 5681) implementation and now believe it works exactly as intended. Next push will be a bunch of tests for the flow from startup to steady state congestion avoidance.
|
7021ab6 to
5937fe3
Compare
- operates in terms of congestion events - takes unACKed byte count - fast-retransmit operates correctly
…FC compliant: - enter slow start (and exit fast recovery) on RTO - prevent multiple `on_loss()` calls triggering window reductions - deflate `cwnd` when leaving fast recovery - cap slow start `cwnd` increment to 1MSS per ACK - use new `in_flight` to calculate `ssthresh`
5937fe3 to
c54a7f3
Compare
|
Reno tests added. Pretty happy with the Reno implementation. However, there's now a big performance degradation in the netsim test. This is due to some changes on fast-retransmit handling. Previously, on loss, all data would be resent rather than just the first segment (as per the RFC). Ontop of this netsim degredation for no CC, when using Reno in the netsim the results are even worse. So... I've created a multi-flow netsim that better shows the benefits of congestion control. These are the initial Reno results as a percentage change from the new no-CC baseline. The new test simulates "realistic" router packet loss (rather than straight randomization) and has per flow RTTs and traffic. We see better fairness, throughput and less overwhelming of the router when multiple parallel flows enter the picture. Next commit (and then probably immediately as a fresh PR) will be the multi-flow netsim stuff. |
|
this can be closed right? the other PRs contain all the changes here |
Edited. Again.
This has turned into a much bigger thing than I thought.
In short, the congestion implementations have many bugs and I'm now trying to fix them. This comes in three parts. Changes to the
Controller, changes toRenoand changes toCubic.The Controller
The controller doesn't understand congestion events and this is a source of multiple bugs.
Controller::retransmitis used to notify of both an RTO and the fast retransmit timer, making it hard for the congestion control implementations to decide between entering slow start or entering fast recovery.Controller::on_duplicate_ackis seemingly treated as notification of a single duplicated ACK (as the name suggests) insocket/tcp.rsbut as a notification of congestion in the congestion control implementations.In CUBIC, after 3 consecutive duplicate ACKs this means
w_maxcould end up more than half what it should be (0.3 * w_maxvs0.7 * w_max). In Reno, after 3 consecutive duplicate ACKs this meansssthreshcould end up three times smaller than it should be (cwnd / 6vscwnd / 2).My fix here has been to distinguish between congestion events (RTO and repeated duplicate ACKs).
I've added
Controller::on_rtoandController::on_loss. These can be used by the congestion algorithms to decide between entering slow start and fast recovery.I've also introduced the
bytes_in_flightparameter to these methods to give the congestion controllers more information (NewReno would like it for example) and addedlentoController::on_dup_ackfor when SACK and D-SACK come about (within a month given the time I've been allocated to all this).Reno
Beyond the bugs that came from the inability to distinguish between loss events and whatnot, there were a number of bugs in the Reno implementation. Here's some:
Exiting fast recovery (receiving a non-duplicated ACK) should deflate the
cwndback to thessthresh(as thecwndis artificially inflated from all the duplicate ACKs we advanced it by). This implementation however had it the wrong way round and was settingssthreshequal to thecwnd. This would have increased the chance of running into more packetloss.Doing fast recovery involves incrementing the
cwndfor each duplicate ACK received. This implementation wasn't doing anything with duplicated ACKs.Not setting
cwndto the correct value after an RTO and entering slow start.CUBIC
Beyond the bugs that came from the inability to distinguish between events, CUBIC had other bugs too. Here's some:
cwndand growth rate for no reason.Next
Treating this as a scratchpad PR. Feedback appreciated. Once happy will break into three PRs. Controller changes, Reno changes, Cubic changes.