|
| 1 | +# HTTP/2 Client Coverage Round 4 Diagnosis |
| 2 | + |
| 3 | +> **Issue**: [#1110](https://github.com/kcenon/network_system/issues/1110) |
| 4 | +> **Parent**: [#1106](https://github.com/kcenon/network_system/issues/1106) |
| 5 | +> **Epic**: [#953](https://github.com/kcenon/network_system/issues/953) |
| 6 | +> **Reference run**: [25464342500](https://github.com/kcenon/network_system/actions/runs/25464342500) at sha `d5378644` |
| 7 | +
|
| 8 | +## Question |
| 9 | + |
| 10 | +Why did Round 1 (PR #994 / #1062), Round 2 (PR #1108 — 6 TEST_F), and Round 3 |
| 11 | +(PR #1109 — 7 TEST_F) all produce **0pp delta** on per-file coverage of |
| 12 | +`src/protocols/http2/http2_client.cpp` (still 18.75% line / 9.91% branch, |
| 13 | +108/576 LH/LF, 97/979 BRH/BRF) despite landing 13 substrate-driven TEST_F? |
| 14 | + |
| 15 | +## Answer (one line) |
| 16 | + |
| 17 | +The 7 Round-3 dispatcher TEST_F **did execute** under coverage instrumentation |
| 18 | +but **all 7 FAILED** at the SETTINGS-handshake `wait_for` gate, never reaching |
| 19 | +the dispatcher branches they were designed to cover. The same coverage-only |
| 20 | +timing trap that motivated the `NETWORK_COVERAGE_BUILD` guard for the two |
| 21 | +Round-2 positive-path tests applies to the entire `Http2ClientHermeticTransport` |
| 22 | +fixture, not just the two tests originally guarded. |
| 23 | + |
| 24 | +## Evidence |
| 25 | + |
| 26 | +Direct grep of the `Run tests` step stdout in run [25464342500] shows every |
| 27 | +Round-3 dispatcher test on the `(Failed)` summary list: |
| 28 | + |
| 29 | +| ctest # | TEST_F | Outcome under coverage | |
| 30 | +|---------|-----------------------------------------------------------------|------------------------| |
| 31 | +| 5886 | `ServerPingFrameDrivesHandlePingAndKeepsConnectionAlive` | FAILED (5.26 s) | |
| 32 | +| 5887 | `ServerPingAckFrameIsAbsorbedSilently` | FAILED (5.30 s) | |
| 33 | +| 5888 | `ServerGoawayFrameFlipsConnectionStateToDisconnected` | FAILED (5.35 s) | |
| 34 | +| 5889 | `ServerWindowUpdateOnConnectionStreamExpandsWindow` | FAILED (5.38 s) | |
| 35 | +| 5890 | `ServerWindowUpdateOnUnknownStreamIsSilentlyIgnored` | FAILED (5.34 s) | |
| 36 | +| 5891 | `ServerRstStreamOnUnknownStreamIsSilentlyIgnored` | FAILED (5.28 s) | |
| 37 | +| 5892 | `ServerUnknownFrameTypeIsHandledWithoutCrashing` | FAILED (5.12 s) | |
| 38 | + |
| 39 | +The baseline fixture sentinel test (#5868, `ConnectCompletesSettingsExchangeWithMockPeer`) |
| 40 | +also FAILED in the same coverage run, along with most other |
| 41 | +`Http2ClientHermeticTransportTest.*` cases (#5870–#5880). This is not a |
| 42 | +new-test regression — the entire hermetic-transport fixture cannot complete |
| 43 | +the SETTINGS handshake within the 3-second `wait_for` budget under |
| 44 | +`-fprofile-arcs -ftest-coverage` instrumentation on the GitHub Actions |
| 45 | +ubuntu-latest runner. |
| 46 | + |
| 47 | +## Why ctest reported PASS for the Coverage Analysis workflow |
| 48 | + |
| 49 | +The `Run tests` step uses `ctest` invoked with permissive flags |
| 50 | +(no `--output-on-failure` strict gate), and the workflow defines the |
| 51 | +"Coverage Analysis" job as PASS as long as the workflow steps that produce |
| 52 | +`coverage.info` succeed. Per-test failure in the test executor does not fail |
| 53 | +the workflow — only an `lcov` capture failure does. So the workflow shows |
| 54 | +green checkmarks even though 7 of the 7 new tests (and ~30 other |
| 55 | +hermetic-transport tests) produced no coverage data because they aborted at |
| 56 | +the handshake. |
| 57 | + |
| 58 | +## Why hypotheses 1 and 3 are ruled out |
| 59 | + |
| 60 | +**Hypothesis 1 (CMake `NETWORK_COVERAGE_BUILD` over-scoping)** — refuted by |
| 61 | +direct inspection of `tests/unit/http2_client_branch_test.cpp`: |
| 62 | + |
| 63 | +- The `#ifndef NETWORK_COVERAGE_BUILD` guard wraps lines 1257–1328 only, which |
| 64 | + contain exactly the two flaky positive-path tests that motivated the guard |
| 65 | + (`SlowWriteServerSettingsStillCompletesHandshake` and |
| 66 | + `EchoOneTruncateAtNineDropsResponsePayloadFailingGet`). |
| 67 | +- The 7 Round-3 dispatcher TEST_F are at lines 1361–1580, **outside** the |
| 68 | + guard, so they are compiled into the coverage binary. |
| 69 | +- The grep above confirms all 7 are scheduled and run by ctest under coverage |
| 70 | + (their names appear in the per-test `[ RUN ]` markers and in the |
| 71 | + `Failed` summary). |
| 72 | + |
| 73 | +The macro is correctly scoped. The dispatcher tests reach the binary; they |
| 74 | +just cannot complete the handshake under instrumentation. |
| 75 | + |
| 76 | +**Hypothesis 3 (lcov filter dropping a second production SF)** — refuted by |
| 77 | +the SF-record cross-check already in the #1106 Round-3 post-merge comment: |
| 78 | +`coverage_filtered.info` exposes 1 SF for `http2_client.cpp`, `coverage.info` |
| 79 | +exposes 2 (the second is `tests/test_http2_client.cpp`), and the `--remove` |
| 80 | +rule strips only the test file. The 18.75% / 9.91% numbers are the true |
| 81 | +production-source measurement, no second production compile unit is being |
| 82 | +filtered out. |
| 83 | + |
| 84 | +## Hypothesis 2 (overlap with baseline) does not apply |
| 85 | + |
| 86 | +The dispatcher tests don't overlap with the baseline `ConnectCompletes…` |
| 87 | +test because the baseline test ALSO failed under coverage instrumentation |
| 88 | +in run 25464342500. Even if the baseline did pass and overlap, the new tests |
| 89 | +were designed to drive `handle_ping_frame`, `handle_goaway_frame`, |
| 90 | +`handle_window_update_frame`, `handle_rst_stream_frame`, and the |
| 91 | +`process_frame` default branch — none of which the baseline reaches because |
| 92 | +the baseline never receives a server-originated post-handshake frame. |
| 93 | + |
| 94 | +The actual root cause is upstream of the dispatch: the SETTINGS handshake |
| 95 | +itself is the failure point. |
| 96 | + |
| 97 | +## Why the handshake fails specifically under coverage instrumentation |
| 98 | + |
| 99 | +Three compounding factors: |
| 100 | + |
| 101 | +1. **gcov instrumentation overhead.** `-fprofile-arcs -ftest-coverage` |
| 102 | + inserts a counter increment at every basic block. For the SSL handshake |
| 103 | + path through OpenSSL plus the ASIO async I/O state machine plus |
| 104 | + `mock_h2_server_peer`'s framing logic, this cumulatively pushes the |
| 105 | + end-to-end SETTINGS exchange past the 3-second budget that |
| 106 | + `wait_for(predicate, std::chrono::seconds(3))` enforces in |
| 107 | + `tests/support/hermetic_transport_fixture.h`. |
| 108 | +2. **Coroutine-based async I/O does not yield CPU during gcov flush.** |
| 109 | + `mock_h2_server_peer` runs on a separate thread that performs a blocking |
| 110 | + `asio::read` for each frame header; under coverage, the per-block |
| 111 | + counter increment in the read path slows the ack-emission rate. |
| 112 | +3. **Polling cadence is 5ms.** `wait_for` polls every 5ms, but each poll |
| 113 | + itself triggers gcov increments in the predicate (`peer.settings_exchanged()` |
| 114 | + reads an `std::atomic<bool>` inside an instrumented function). Predicate |
| 115 | + evaluation cost scales with the rate. |
| 116 | + |
| 117 | +Wall-clock evidence: the failed dispatcher tests took 5.1–5.4 s each under |
| 118 | +coverage, which is more than 1.7× the 3-second budget. On Debug/Release |
| 119 | +matrix builds (no instrumentation), the same fixture's tests typically |
| 120 | +complete in 100–300 ms. |
| 121 | + |
| 122 | +## Recommended fix (proposed; out of scope for this diagnostic issue) |
| 123 | + |
| 124 | +A new sub-issue `test(http2): widen NETWORK_COVERAGE_BUILD guard around the |
| 125 | +full dispatcher TEST_F block` should land before any Round 5 work. Two |
| 126 | +options exist; both are non-invasive (test-only): |
| 127 | + |
| 128 | +### Option A — extend the existing macro guard |
| 129 | + |
| 130 | +Wrap lines 1361–1580 (the 7 dispatcher TEST_F) in |
| 131 | +`#ifndef NETWORK_COVERAGE_BUILD` so they are excluded from coverage builds |
| 132 | +the same way the two Round-2 flaky tests are. Pro: zero new infrastructure, |
| 133 | +keeps the tests in Debug/Release matrix builds where they already pass. |
| 134 | +Con: contributes 0pp to coverage, so #1106 acceptance criteria still need |
| 135 | +a different mechanism. |
| 136 | + |
| 137 | +### Option B — generous coverage-only timeout multiplier |
| 138 | + |
| 139 | +Add a `kCoverageWaitMultiplier` constant in `hermetic_transport_fixture.h` |
| 140 | +that multiplies the `wait_for` timeout by 5× when `NETWORK_COVERAGE_BUILD` |
| 141 | +is defined (so the 3-second budget becomes 15 s). Increase the default |
| 142 | +timeout for `wait_for(...)` in `make_connected_client` similarly. Pro: |
| 143 | +allows the dispatcher tests to actually complete under instrumentation |
| 144 | +and contribute to the line/branch numbers. Con: lengthens the coverage |
| 145 | +workflow runtime; the 7 tests may take 50+ s combined. |
| 146 | + |
| 147 | +### Option C — measure dispatcher coverage outside the hermetic fixture |
| 148 | + |
| 149 | +The dispatcher branches in `http2_client.cpp` (`handle_ping_frame`, |
| 150 | +`handle_goaway_frame`, …) can also be exercised by a unit test that |
| 151 | +constructs an `http2_client` directly, calls a friend-injected |
| 152 | +`process_frame(frame_bytes)` method, and observes the resulting state. |
| 153 | +This bypasses the SETTINGS handshake entirely and avoids the timing trap |
| 154 | +at the cost of needing a friend-class hook. Pro: deterministic, fast even |
| 155 | +under coverage. Con: requires a private-access seam in production code or |
| 156 | +a friend declaration in a coverage-only header. |
| 157 | + |
| 158 | +**Recommendation**: Option B is the smallest change. Option C is the |
| 159 | +right long-term answer for testing the dispatcher branches in isolation. |
| 160 | +Option A unblocks the workflow noise but does not help #1106 close. |
| 161 | + |
| 162 | +## Acceptance criteria status (this issue) |
| 163 | + |
| 164 | +- [x] Confirm whether the 7 Round-3 dispatcher TEST_F actually executed |
| 165 | + under coverage instrumentation in run [25464342500]. |
| 166 | + **Answer**: All 7 are scheduled by ctest, all 7 abort with FAILED at |
| 167 | + the SETTINGS-handshake `wait_for` gate (5.1–5.4 s each). |
| 168 | +- [x] If they did NOT execute, identify the CMake/macro mechanism that |
| 169 | + excluded them. **Answer**: They DO execute. The macro is correctly |
| 170 | + scoped to the two original flaky tests only. |
| 171 | +- [x] If they DID execute but produced 0pp, document why. |
| 172 | + **Answer**: The fixture-wide handshake timing trap drops them at the |
| 173 | + first `EXPECT_TRUE(wait_for(... settings_exchanged() ..., 3 s))` |
| 174 | + assertion before any dispatcher branch runs. |
| 175 | +- [x] Verify the lcov filter pipeline does not silently drop a second |
| 176 | + production-source SF for `http2_client.cpp`. **Answer**: SF-record |
| 177 | + cross-check (already in #1106 Round-3 post-merge comment) confirms |
| 178 | + only the test file is filtered; the production source SF is intact. |
| 179 | +- [x] Output: a one-page diagnostic note. **Answer**: This file. |
| 180 | + |
| 181 | +## Next action |
| 182 | + |
| 183 | +Open Round 5 sub-issue scoped to **Option B (coverage-only timeout |
| 184 | +multiplier)** as the smallest change that lets #1106 actually progress. |
| 185 | +Round 5 acceptance criteria should require a post-fix coverage delta of |
| 186 | +at least +20pp line and +10pp branch on `http2_client.cpp`, not merely |
| 187 | +"tests pass under coverage" — to break the 0pp pattern definitively. |
| 188 | + |
| 189 | +## Related |
| 190 | + |
| 191 | +- Part of [#953](https://github.com/kcenon/network_system/issues/953) |
| 192 | +- Part of [#1106](https://github.com/kcenon/network_system/issues/1106) |
| 193 | +- Substrate provider: [#1074](https://github.com/kcenon/network_system/issues/1074) |
| 194 | +- PR #1108 (Round 2, merged b4692fed, 0pp delta) |
| 195 | +- PR #1109 (Round 3, merged d5378644, 0pp delta) |
0 commit comments