Skip to content

Congestion Control Overhaul#1149

Open
JPDye wants to merge 4 commits into
smoltcp-rs:mainfrom
JPDye:cubic-reduction-fix
Open

Congestion Control Overhaul#1149
JPDye wants to merge 4 commits into
smoltcp-rs:mainfrom
JPDye:cubic-reduction-fix

Conversation

@JPDye

@JPDye JPDye commented May 5, 2026

Copy link
Copy Markdown
Contributor

Edited. Again.

TL;DR: refactors Controller trait to distinguish RTO / 3-dup-ACK / per-dup-ACK events, fixes some Reno bugs and fixes some CUBIC bugs, with more fixes and tests to come.

This has turned into a much bigger thing than I thought.

In short, the congestion implementations have many bugs and I'm now trying to fix them. This comes in three parts. Changes to the Controller, changes to Reno and changes to Cubic.

The Controller

The controller doesn't understand congestion events and this is a source of multiple bugs.

Controller::retransmit is used to notify of both an RTO and the fast retransmit timer, making it hard for the congestion control implementations to decide between entering slow start or entering fast recovery.

Controller::on_duplicate_ack is seemingly treated as notification of a single duplicated ACK (as the name suggests) in socket/tcp.rs but as a notification of congestion in the congestion control implementations.

In CUBIC, after 3 consecutive duplicate ACKs this means w_max could end up more than half what it should be (0.3 * w_max vs 0.7 * w_max). In Reno, after 3 consecutive duplicate ACKs this means ssthresh could end up three times smaller than it should be (cwnd / 6 vs cwnd / 2).

My fix here has been to distinguish between congestion events (RTO and repeated duplicate ACKs).

I've added Controller::on_rto and Controller::on_loss. These can be used by the congestion algorithms to decide between entering slow start and fast recovery.

I've also introduced the bytes_in_flight parameter to these methods to give the congestion controllers more information (NewReno would like it for example) and added len to Controller::on_dup_ack for when SACK and D-SACK come about (within a month given the time I've been allocated to all this).

Reno

Beyond the bugs that came from the inability to distinguish between loss events and whatnot, there were a number of bugs in the Reno implementation. Here's some:

  • Exiting fast recovery (receiving a non-duplicated ACK) should deflate the cwnd back to the ssthresh (as the cwnd is artificially inflated from all the duplicate ACKs we advanced it by). This implementation however had it the wrong way round and was setting ssthresh equal to the cwnd. This would have increased the chance of running into more packetloss.

  • Doing fast recovery involves incrementing the cwnd for each duplicate ACK received. This implementation wasn't doing anything with duplicated ACKs.

  • Not setting cwnd to the correct value after an RTO and entering slow start.

CUBIC

Beyond the bugs that came from the inability to distinguish between events, CUBIC had other bugs too. Here's some:

  • Entering fast recovery on startup without any packet loss occurring, significantly reducing cwnd and growth rate for no reason.

Next

Treating this as a scratchpad PR. Feedback appreciated. Once happy will break into three PRs. Controller changes, Reno changes, Cubic changes.

@JPDye JPDye changed the title test for dup-ack cwnd reduction (+ discovered early recovery bug) Fix CUBIC congestion window bugs May 5, 2026
@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.51%. Comparing base (ffeaf62) to head (c54a7f3).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1149      +/-   ##
==========================================
+ Coverage   81.48%   81.51%   +0.02%     
==========================================
  Files          81       81              
  Lines       25007    25040      +33     
==========================================
+ Hits        20378    20412      +34     
+ Misses       4629     4628       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JPDye JPDye force-pushed the cubic-reduction-fix branch 5 times, most recently from 1031a6b to 7021ab6 Compare May 7, 2026 11:12
@JPDye

JPDye commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Have overhauled Reno (RFC 5681) implementation and now believe it works exactly as intended. Next push will be a bunch of tests for the flow from startup to steady state congestion avoidance.

  1. slow start -> rto -> slow start -> congestion avoidance
  2. slow start -> trip dup ack -> fast recovery -> congestion avoidance

@JPDye JPDye changed the title Fix CUBIC congestion window bugs Congestion Control Overhaul May 7, 2026
@JPDye JPDye force-pushed the cubic-reduction-fix branch from 7021ab6 to 5937fe3 Compare May 20, 2026 14:40
Dirbaio and others added 4 commits May 20, 2026 15:46
- operates in terms of congestion events
- takes unACKed byte count
- fast-retransmit operates correctly
…FC compliant:

- enter slow start (and exit fast recovery) on RTO
- prevent multiple `on_loss()` calls triggering window reductions
- deflate `cwnd` when leaving fast recovery
- cap slow start `cwnd` increment to 1MSS per ACK
- use new `in_flight` to calculate `ssthresh`
@JPDye JPDye force-pushed the cubic-reduction-fix branch from 5937fe3 to c54a7f3 Compare May 20, 2026 14:46
@JPDye

JPDye commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Reno tests added. Pretty happy with the Reno implementation.

However, there's now a big performance degradation in the netsim test. This is due to some changes on fast-retransmit handling. Previously, on loss, all data would be resent rather than just the first segment (as per the RFC).

Ontop of this netsim degredation for no CC, when using Reno in the netsim the results are even worse. So... I've created a multi-flow netsim that better shows the benefits of congestion control.

These are the initial Reno results as a percentage change from the new no-CC baseline.

╭───┬───────┬──────────┬──────────┬──────────┬──────────┬────────╮
│ # │ flows │ agg_thru │ min_thru │ max_thru │ fairness │ drops  │
├───┼───────┼──────────┼──────────┼──────────┼──────────┼────────┤
│ 0 │     1 │ -6.6%    │ -6.6%    │ -6.6%    │ 0%       │ 0%     │
│ 1 │     2 │ -6.2%    │ -6.4%    │ -6%      │ 0%       │ 0%     │
│ 2 │     4 │ -3.4%    │ -5.2%    │ -2.2%    │ -0.4%    │ 0%     │
│ 3 │    16 │ +13.2%   │ +70.3%   │ -0.2%    │ +4.5%    │ -81.3% │
│ 4 │    32 │ +16.1%   │ +93%     │ -15.2%   │ +8.1%    │ -78.7% │
│ 5 │    64 │ +16.8%   │ +149.3%  │ -19.6%   │ +8.1%    │ -67.2% │
╰───┴───────┴──────────┴──────────┴──────────┴──────────┴────────╯

The new test simulates "realistic" router packet loss (rather than straight randomization) and has per flow RTTs and traffic. We see better fairness, throughput and less overwhelming of the router when multiple parallel flows enter the picture.

Next commit (and then probably immediately as a fresh PR) will be the multi-flow netsim stuff.

@JPDye JPDye mentioned this pull request May 26, 2026
@Dirbaio

Dirbaio commented Jun 14, 2026

Copy link
Copy Markdown
Member

this can be closed right? the other PRs contain all the changes here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants