You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(capture): missing response bodies, serial false positives — v0.6.1 (#38)
* fix(sanitization): serial number false positives, JS variable detection, pipe testability
Two bugs found via TG3442DE captures: (1) SN prefix in JSON keys like
SNRLevel matched the serial pattern due to missing word boundary, and
(2) JS variable assignments like `var js_SerialNumber = 'value'` were
not detected. Fixes both with spec-first approach: document the 100%
confidence boundary for scanner passes, then TDD the implementation.
Additionally extracts _sanitize_pipe_value() from the 70-line nested
closure into a module-level function for direct unit testing, removes
dead code (unreachable version-string guard for private IPs), and adds
37 fixture-driven test cases covering 9 previously untested scanner
passes. html.py coverage: 73% → 95%.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(capture): patch missing response bodies from eager capture cache
Playwright's HAR recorder fetches bodies lazily at context.close() via
CDP Network.getResponseBody. If a navigation evicts the response from
Chrome's buffer before the flush, the body is lost. This adds eager
body capture via page.on("response") and patches missing bodies into
the HAR before sanitization.
Also removes the no-op context.route("**/*") handler that enabled the
CDP Fetch domain, which can interfere with Network domain body capture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore(release): bump version to 0.6.1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ken Schulz <kwschulz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+13-1Lines changed: 13 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.6.1] - 2026-04-08
11
+
12
+
### Fixed
13
+
14
+
-**Missing response body on initial page load** — Playwright's HAR recorder fetches bodies lazily at `context.close()` via CDP `Network.getResponseBody`. If a navigation (e.g., form POST) evicts the response from Chrome's buffer before the flush, the body is lost — headers and sizes are correct but `content.text` is absent. Added eager body capture via `page.on("response")` for text content types and a post-capture `_patch_missing_bodies()` step that fills missing bodies from the cache before sanitization.
15
+
-**Serial number false positives** — Reduced false positive serial number detections in sanitization. Improved JS variable name detection and pipe-delimited pattern testability.
16
+
17
+
### Removed
18
+
19
+
-**No-op route handler** — Removed `context.route("**/*", lambda route: route.continue_())` which enabled the CDP Fetch domain unnecessarily, potentially interfering with HAR body capture.
Copy file name to clipboardExpand all lines: docs/ARCHITECTURE.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -175,6 +175,8 @@ The pattern system is how har-capture stays domain-agnostic while supporting dom
175
175
176
176
This means a consumer like cable_modem_monitor ships its own pattern file and gets domain-tuned sanitization without har-capture carrying any modem-specific code. Consumers can layer multiple `--patterns` arguments for incremental customization.
177
177
178
+
**Confidence boundary:** Domain `pii.patterns` entries run as Pass 0 auto-redaction — they must have 100% confidence (zero false positives). Domain `heuristics.detectors` entries flag values for interactive review and can tolerate lower confidence. When a domain-specific pattern cannot guarantee zero false positives, it belongs in `heuristics.detectors`, not `pii.patterns`.
179
+
178
180
See [Pattern Spec](specs/PATTERN_SPEC.md) for file schemas, merge semantics, and the loader/cache architecture.
Copy file name to clipboardExpand all lines: docs/specs/CAPTURE_SPEC.md
+35-3Lines changed: 35 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -159,13 +159,14 @@ Controls the `wait_until` argument to Playwright's `page.goto()`. Accepts any Pl
159
159
160
160
### Internal Decomposition
161
161
162
-
`capture_device_har()` is the public API — its signature is unchanged. Internally, it delegates to four extracted functions that can each be tested independently:
162
+
`capture_device_har()` is the public API — its signature is unchanged. Internally, it delegates to five extracted functions that can each be tested independently:
@@ -397,7 +399,33 @@ This is more robust than Playwright's `networkidle` (500ms) — it waits for `_D
397
399
398
400
#### `--no-wait-for-data` Behavior
399
401
400
-
When disabled, no JS injection, no quiescence polling, and no `framenavigated` listener. A context-level route (`context.route("**/*", ...)`) still handles cache control. `page.goto()` with `wait_until="networkidle"` is the only wait mechanism.
402
+
When disabled, no JS injection, no quiescence polling, and no `framenavigated` listener. `page.goto()` with `wait_until="networkidle"` is the only wait mechanism. The eager response body capture listener (`page.on("response")`) is always active regardless of this flag.
403
+
404
+
### Eager Response Body Capture
405
+
406
+
#### Problem
407
+
408
+
Playwright's `record_har_content="embed"` captures response bodies lazily — it calls CDP `Network.getResponseBody` when `context.close()` flushes the HAR to disk. If a navigation event causes Chrome to evict the response data from its network buffer before the flush, the body is lost. Headers, sizes, and timing are correct (captured synchronously from Network domain events), but `content.text` is absent and `content.size` is `-1`.
409
+
410
+
This typically affects the initial page load (e.g., a login form) when the user or JavaScript submits a form quickly, triggering a navigation that supersedes the first response.
411
+
412
+
#### Solution: Eager Body Capture via Response Listener
413
+
414
+
Before navigation, a `page.on("response")` listener is registered that eagerly calls `response.body()` for text-based content types (`text/*`, `application/json`, `application/xml`). Bodies are stored in `BrowserSessionResult.captured_bodies` keyed by `"<method>|<url>|<status>"`.
415
+
416
+
After Playwright writes the HAR and before metadata injection, `_patch_missing_bodies()` scans HAR entries for responses that have `bodySize > 0` or `_transferSize > 0` but no `content.text`. For each missing body, it looks up the key in the captured bodies cache and patches the body into the HAR entry.
417
+
418
+
Text bodies are stored as plain UTF-8 strings. Non-UTF-8 bodies fall back to base64 encoding with `content.encoding = "base64"`.
- For each entry missing `content.text` with `bodySize > 0` or `_transferSize > 0`: looks up `"<method>|<url>|<status>"` in `captured_bodies` and patches the body
424
+
- Writes the patched HAR back to `temp_path`
425
+
- Returns the number of entries patched
426
+
- Handles corrupt HAR files gracefully (returns 0)
427
+
428
+
Testable with: zero mocks (real temp file).
401
429
402
430
### Timeout vs Interactive Mode
403
431
@@ -459,9 +487,13 @@ Error messages are sanitized via `_sanitize_error_message()` to remove any embed
459
487
460
488
## Post-Capture Processing
461
489
490
+
### Body Patching
491
+
492
+
Immediately after the browser closes and writes the raw HAR, `_patch_missing_bodies()` scans for entries with missing response bodies and patches them from the eagerly captured body cache. This runs before metadata injection and sanitization so that downstream processing sees complete responses.
Additional PII detection patterns. Same schema as `pii.json``patterns` entries.
309
+
Additional PII patterns for deterministic auto-redaction (Pass 0 of the HTML scanner). Same schema as `pii.json``patterns` entries.
310
+
311
+
**Confidence requirement:** Every pattern runs as auto-redact — the matched value is replaced without user review. Patterns MUST achieve 100% confidence. If a pattern cannot meet this bar, use `heuristics.detectors` instead.
Copy file name to clipboardExpand all lines: docs/specs/SANITIZATION_SPEC.md
+28-24Lines changed: 28 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -209,30 +209,33 @@ MIME-type based routing:
209
209
210
210
The engine runs sequential passes over HTML/JavaScript content (numbered 0–16 in the code, with sub-passes like 0b, 2b, 7a/7b, 8b). Each pass uses regex substitution with callback functions that invoke the hasher.
**Pass 2c precision rule:** Matches variable names containing the compound `serial` + `number`/`num`/`no` (with optional separator), and names ending with `serial`. Does NOT match `serial` followed by unrelated suffixes (`Protocol`, `Port`, `Baud`, `ization`). Bare `serial` is excluded — too ambiguous for auto-redact.
236
239
237
240
### Web Storage Scanner (Pass 0b)
238
241
@@ -551,3 +554,4 @@ Detects common redaction markers to warn users before double-sanitizing.
551
554
1.**Cookie metadata is preserved** — Cookie attributes (HttpOnly, Secure, SameSite, Path, Domain, Expires) are detected and not redacted. Only cookie values are redacted.
552
555
1.**Credit card detection requires Luhn** — A 16-digit number is only redacted as a credit card if it passes Luhn checksum validation.
553
556
1.**Global find-replace in Pass 2** — User-selected redactions are applied via string replacement on the serialized JSON, ensuring all occurrences (headers, body, URLs) are caught.
557
+
1.**Scanner passes require 100% confidence** — Every regex in the HTML scanner pipeline (passes 0–16) auto-redacts without user review. A pattern that produces false positives is a bug. Patterns that cannot achieve 100% confidence belong in the heuristic engine (flagged for user review), not the scanner pipeline.
0 commit comments