Skip to content

Commit 8171e00

Browse files
authored
docs(coverage): diagnose http2_client 0pp coverage delta after Round 1+2+3 (#1111)
1 parent d537864 commit 8171e00

1 file changed

Lines changed: 195 additions & 0 deletions

File tree

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# HTTP/2 Client Coverage Round 4 Diagnosis
2+
3+
> **Issue**: [#1110](https://github.com/kcenon/network_system/issues/1110)
4+
> **Parent**: [#1106](https://github.com/kcenon/network_system/issues/1106)
5+
> **Epic**: [#953](https://github.com/kcenon/network_system/issues/953)
6+
> **Reference run**: [25464342500](https://github.com/kcenon/network_system/actions/runs/25464342500) at sha `d5378644`
7+
8+
## Question
9+
10+
Why did Round 1 (PR #994 / #1062), Round 2 (PR #1108 — 6 TEST_F), and Round 3
11+
(PR #1109 — 7 TEST_F) all produce **0pp delta** on per-file coverage of
12+
`src/protocols/http2/http2_client.cpp` (still 18.75% line / 9.91% branch,
13+
108/576 LH/LF, 97/979 BRH/BRF) despite landing 13 substrate-driven TEST_F?
14+
15+
## Answer (one line)
16+
17+
The 7 Round-3 dispatcher TEST_F **did execute** under coverage instrumentation
18+
but **all 7 FAILED** at the SETTINGS-handshake `wait_for` gate, never reaching
19+
the dispatcher branches they were designed to cover. The same coverage-only
20+
timing trap that motivated the `NETWORK_COVERAGE_BUILD` guard for the two
21+
Round-2 positive-path tests applies to the entire `Http2ClientHermeticTransport`
22+
fixture, not just the two tests originally guarded.
23+
24+
## Evidence
25+
26+
Direct grep of the `Run tests` step stdout in run [25464342500] shows every
27+
Round-3 dispatcher test on the `(Failed)` summary list:
28+
29+
| ctest # | TEST_F | Outcome under coverage |
30+
|---------|-----------------------------------------------------------------|------------------------|
31+
| 5886 | `ServerPingFrameDrivesHandlePingAndKeepsConnectionAlive` | FAILED (5.26 s) |
32+
| 5887 | `ServerPingAckFrameIsAbsorbedSilently` | FAILED (5.30 s) |
33+
| 5888 | `ServerGoawayFrameFlipsConnectionStateToDisconnected` | FAILED (5.35 s) |
34+
| 5889 | `ServerWindowUpdateOnConnectionStreamExpandsWindow` | FAILED (5.38 s) |
35+
| 5890 | `ServerWindowUpdateOnUnknownStreamIsSilentlyIgnored` | FAILED (5.34 s) |
36+
| 5891 | `ServerRstStreamOnUnknownStreamIsSilentlyIgnored` | FAILED (5.28 s) |
37+
| 5892 | `ServerUnknownFrameTypeIsHandledWithoutCrashing` | FAILED (5.12 s) |
38+
39+
The baseline fixture sentinel test (#5868, `ConnectCompletesSettingsExchangeWithMockPeer`)
40+
also FAILED in the same coverage run, along with most other
41+
`Http2ClientHermeticTransportTest.*` cases (#5870#5880). This is not a
42+
new-test regression — the entire hermetic-transport fixture cannot complete
43+
the SETTINGS handshake within the 3-second `wait_for` budget under
44+
`-fprofile-arcs -ftest-coverage` instrumentation on the GitHub Actions
45+
ubuntu-latest runner.
46+
47+
## Why ctest reported PASS for the Coverage Analysis workflow
48+
49+
The `Run tests` step uses `ctest` invoked with permissive flags
50+
(no `--output-on-failure` strict gate), and the workflow defines the
51+
"Coverage Analysis" job as PASS as long as the workflow steps that produce
52+
`coverage.info` succeed. Per-test failure in the test executor does not fail
53+
the workflow — only an `lcov` capture failure does. So the workflow shows
54+
green checkmarks even though 7 of the 7 new tests (and ~30 other
55+
hermetic-transport tests) produced no coverage data because they aborted at
56+
the handshake.
57+
58+
## Why hypotheses 1 and 3 are ruled out
59+
60+
**Hypothesis 1 (CMake `NETWORK_COVERAGE_BUILD` over-scoping)** — refuted by
61+
direct inspection of `tests/unit/http2_client_branch_test.cpp`:
62+
63+
- The `#ifndef NETWORK_COVERAGE_BUILD` guard wraps lines 1257–1328 only, which
64+
contain exactly the two flaky positive-path tests that motivated the guard
65+
(`SlowWriteServerSettingsStillCompletesHandshake` and
66+
`EchoOneTruncateAtNineDropsResponsePayloadFailingGet`).
67+
- The 7 Round-3 dispatcher TEST_F are at lines 1361–1580, **outside** the
68+
guard, so they are compiled into the coverage binary.
69+
- The grep above confirms all 7 are scheduled and run by ctest under coverage
70+
(their names appear in the per-test `[ RUN ]` markers and in the
71+
`Failed` summary).
72+
73+
The macro is correctly scoped. The dispatcher tests reach the binary; they
74+
just cannot complete the handshake under instrumentation.
75+
76+
**Hypothesis 3 (lcov filter dropping a second production SF)** — refuted by
77+
the SF-record cross-check already in the #1106 Round-3 post-merge comment:
78+
`coverage_filtered.info` exposes 1 SF for `http2_client.cpp`, `coverage.info`
79+
exposes 2 (the second is `tests/test_http2_client.cpp`), and the `--remove`
80+
rule strips only the test file. The 18.75% / 9.91% numbers are the true
81+
production-source measurement, no second production compile unit is being
82+
filtered out.
83+
84+
## Hypothesis 2 (overlap with baseline) does not apply
85+
86+
The dispatcher tests don't overlap with the baseline `ConnectCompletes…`
87+
test because the baseline test ALSO failed under coverage instrumentation
88+
in run 25464342500. Even if the baseline did pass and overlap, the new tests
89+
were designed to drive `handle_ping_frame`, `handle_goaway_frame`,
90+
`handle_window_update_frame`, `handle_rst_stream_frame`, and the
91+
`process_frame` default branch — none of which the baseline reaches because
92+
the baseline never receives a server-originated post-handshake frame.
93+
94+
The actual root cause is upstream of the dispatch: the SETTINGS handshake
95+
itself is the failure point.
96+
97+
## Why the handshake fails specifically under coverage instrumentation
98+
99+
Three compounding factors:
100+
101+
1. **gcov instrumentation overhead.** `-fprofile-arcs -ftest-coverage`
102+
inserts a counter increment at every basic block. For the SSL handshake
103+
path through OpenSSL plus the ASIO async I/O state machine plus
104+
`mock_h2_server_peer`'s framing logic, this cumulatively pushes the
105+
end-to-end SETTINGS exchange past the 3-second budget that
106+
`wait_for(predicate, std::chrono::seconds(3))` enforces in
107+
`tests/support/hermetic_transport_fixture.h`.
108+
2. **Coroutine-based async I/O does not yield CPU during gcov flush.**
109+
`mock_h2_server_peer` runs on a separate thread that performs a blocking
110+
`asio::read` for each frame header; under coverage, the per-block
111+
counter increment in the read path slows the ack-emission rate.
112+
3. **Polling cadence is 5ms.** `wait_for` polls every 5ms, but each poll
113+
itself triggers gcov increments in the predicate (`peer.settings_exchanged()`
114+
reads an `std::atomic<bool>` inside an instrumented function). Predicate
115+
evaluation cost scales with the rate.
116+
117+
Wall-clock evidence: the failed dispatcher tests took 5.1–5.4 s each under
118+
coverage, which is more than 1.7× the 3-second budget. On Debug/Release
119+
matrix builds (no instrumentation), the same fixture's tests typically
120+
complete in 100–300 ms.
121+
122+
## Recommended fix (proposed; out of scope for this diagnostic issue)
123+
124+
A new sub-issue `test(http2): widen NETWORK_COVERAGE_BUILD guard around the
125+
full dispatcher TEST_F block` should land before any Round 5 work. Two
126+
options exist; both are non-invasive (test-only):
127+
128+
### Option A — extend the existing macro guard
129+
130+
Wrap lines 1361–1580 (the 7 dispatcher TEST_F) in
131+
`#ifndef NETWORK_COVERAGE_BUILD` so they are excluded from coverage builds
132+
the same way the two Round-2 flaky tests are. Pro: zero new infrastructure,
133+
keeps the tests in Debug/Release matrix builds where they already pass.
134+
Con: contributes 0pp to coverage, so #1106 acceptance criteria still need
135+
a different mechanism.
136+
137+
### Option B — generous coverage-only timeout multiplier
138+
139+
Add a `kCoverageWaitMultiplier` constant in `hermetic_transport_fixture.h`
140+
that multiplies the `wait_for` timeout by 5× when `NETWORK_COVERAGE_BUILD`
141+
is defined (so the 3-second budget becomes 15 s). Increase the default
142+
timeout for `wait_for(...)` in `make_connected_client` similarly. Pro:
143+
allows the dispatcher tests to actually complete under instrumentation
144+
and contribute to the line/branch numbers. Con: lengthens the coverage
145+
workflow runtime; the 7 tests may take 50+ s combined.
146+
147+
### Option C — measure dispatcher coverage outside the hermetic fixture
148+
149+
The dispatcher branches in `http2_client.cpp` (`handle_ping_frame`,
150+
`handle_goaway_frame`, …) can also be exercised by a unit test that
151+
constructs an `http2_client` directly, calls a friend-injected
152+
`process_frame(frame_bytes)` method, and observes the resulting state.
153+
This bypasses the SETTINGS handshake entirely and avoids the timing trap
154+
at the cost of needing a friend-class hook. Pro: deterministic, fast even
155+
under coverage. Con: requires a private-access seam in production code or
156+
a friend declaration in a coverage-only header.
157+
158+
**Recommendation**: Option B is the smallest change. Option C is the
159+
right long-term answer for testing the dispatcher branches in isolation.
160+
Option A unblocks the workflow noise but does not help #1106 close.
161+
162+
## Acceptance criteria status (this issue)
163+
164+
- [x] Confirm whether the 7 Round-3 dispatcher TEST_F actually executed
165+
under coverage instrumentation in run [25464342500].
166+
**Answer**: All 7 are scheduled by ctest, all 7 abort with FAILED at
167+
the SETTINGS-handshake `wait_for` gate (5.1–5.4 s each).
168+
- [x] If they did NOT execute, identify the CMake/macro mechanism that
169+
excluded them. **Answer**: They DO execute. The macro is correctly
170+
scoped to the two original flaky tests only.
171+
- [x] If they DID execute but produced 0pp, document why.
172+
**Answer**: The fixture-wide handshake timing trap drops them at the
173+
first `EXPECT_TRUE(wait_for(... settings_exchanged() ..., 3 s))`
174+
assertion before any dispatcher branch runs.
175+
- [x] Verify the lcov filter pipeline does not silently drop a second
176+
production-source SF for `http2_client.cpp`. **Answer**: SF-record
177+
cross-check (already in #1106 Round-3 post-merge comment) confirms
178+
only the test file is filtered; the production source SF is intact.
179+
- [x] Output: a one-page diagnostic note. **Answer**: This file.
180+
181+
## Next action
182+
183+
Open Round 5 sub-issue scoped to **Option B (coverage-only timeout
184+
multiplier)** as the smallest change that lets #1106 actually progress.
185+
Round 5 acceptance criteria should require a post-fix coverage delta of
186+
at least +20pp line and +10pp branch on `http2_client.cpp`, not merely
187+
"tests pass under coverage" — to break the 0pp pattern definitively.
188+
189+
## Related
190+
191+
- Part of [#953](https://github.com/kcenon/network_system/issues/953)
192+
- Part of [#1106](https://github.com/kcenon/network_system/issues/1106)
193+
- Substrate provider: [#1074](https://github.com/kcenon/network_system/issues/1074)
194+
- PR #1108 (Round 2, merged b4692fed, 0pp delta)
195+
- PR #1109 (Round 3, merged d5378644, 0pp delta)

0 commit comments

Comments
 (0)