Skip to content

Commit 7944e9e

Browse files
kwschulzclaude
andauthored
fix(capture): missing response bodies, serial false positives — v0.6.1 (#38)
* fix(sanitization): serial number false positives, JS variable detection, pipe testability Two bugs found via TG3442DE captures: (1) SN prefix in JSON keys like SNRLevel matched the serial pattern due to missing word boundary, and (2) JS variable assignments like `var js_SerialNumber = 'value'` were not detected. Fixes both with spec-first approach: document the 100% confidence boundary for scanner passes, then TDD the implementation. Additionally extracts _sanitize_pipe_value() from the 70-line nested closure into a module-level function for direct unit testing, removes dead code (unreachable version-string guard for private IPs), and adds 37 fixture-driven test cases covering 9 previously untested scanner passes. html.py coverage: 73% → 95%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(capture): patch missing response bodies from eager capture cache Playwright's HAR recorder fetches bodies lazily at context.close() via CDP Network.getResponseBody. If a navigation evicts the response from Chrome's buffer before the flush, the body is lost. This adds eager body capture via page.on("response") and patches missing bodies into the HAR before sanitization. Also removes the no-op context.route("**/*") handler that enabled the CDP Fetch domain, which can interfere with Network domain body capture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(release): bump version to 0.6.1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ken Schulz <kwschulz@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 132c86c commit 7944e9e

15 files changed

Lines changed: 789 additions & 127 deletions

File tree

CHANGELOG.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.6.1] - 2026-04-08
11+
12+
### Fixed
13+
14+
- **Missing response body on initial page load** — Playwright's HAR recorder fetches bodies lazily at `context.close()` via CDP `Network.getResponseBody`. If a navigation (e.g., form POST) evicts the response from Chrome's buffer before the flush, the body is lost — headers and sizes are correct but `content.text` is absent. Added eager body capture via `page.on("response")` for text content types and a post-capture `_patch_missing_bodies()` step that fills missing bodies from the cache before sanitization.
15+
- **Serial number false positives** — Reduced false positive serial number detections in sanitization. Improved JS variable name detection and pipe-delimited pattern testability.
16+
17+
### Removed
18+
19+
- **No-op route handler** — Removed `context.route("**/*", lambda route: route.continue_())` which enabled the CDP Fetch domain unnecessarily, potentially interfering with HAR body capture.
20+
1021
## [0.6.0] - 2026-04-06
1122

1223
### Changed
@@ -464,4 +475,5 @@ har-capture sanitize input.har --patterns custom-allowlist.json
464475
[0.5.0]: https://github.com/solentlabs/har-capture/compare/v0.4.5...v0.5.0
465476
[0.5.1]: https://github.com/solentlabs/har-capture/compare/v0.5.0...v0.5.1
466477
[0.6.0]: https://github.com/solentlabs/har-capture/compare/v0.5.1...v0.6.0
467-
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.6.0...HEAD
478+
[0.6.1]: https://github.com/solentlabs/har-capture/compare/v0.6.0...v0.6.1
479+
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.6.1...HEAD

docs/ARCHITECTURE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,8 @@ The pattern system is how har-capture stays domain-agnostic while supporting dom
175175

176176
This means a consumer like cable_modem_monitor ships its own pattern file and gets domain-tuned sanitization without har-capture carrying any modem-specific code. Consumers can layer multiple `--patterns` arguments for incremental customization.
177177

178+
**Confidence boundary:** Domain `pii.patterns` entries run as Pass 0 auto-redaction — they must have 100% confidence (zero false positives). Domain `heuristics.detectors` entries flag values for interactive review and can tolerate lower confidence. When a domain-specific pattern cannot guarantee zero false positives, it belongs in `heuristics.detectors`, not `pii.patterns`.
179+
178180
See [Pattern Spec](specs/PATTERN_SPEC.md) for file schemas, merge semantics, and the loader/cache architecture.
179181

180182
## Functional Specs

docs/specs/CAPTURE_SPEC.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -159,13 +159,14 @@ Controls the `wait_until` argument to Playwright's `page.goto()`. Accepts any Pl
159159

160160
### Internal Decomposition
161161

162-
`capture_device_har()` is the public API — its signature is unchanged. Internally, it delegates to four extracted functions that can each be tested independently:
162+
`capture_device_har()` is the public API — its signature is unchanged. Internally, it delegates to five extracted functions that can each be tested independently:
163163

164164
```
165165
capture_device_har()
166166
├── Pre-flight checks (check_playwright, check_browser_installed)
167167
├── _resolve_capture_paths() → CapturePathInfo
168168
├── _run_browser_session() → BrowserSessionResult
169+
├── _patch_missing_bodies() → patches temp HAR in-place
169170
├── _inject_har_metadata() → modifies temp HAR in-place
170171
└── _run_post_capture_pipeline() → CaptureResult
171172
```
@@ -256,6 +257,7 @@ class BrowserSessionResult:
256257
browser_cookies: list[Any] # Cookies after page load
257258
web_storage_local: list[dict[str, Any]] # localStorage entries per origin
258259
web_storage_session: dict[str, str] # sessionStorage key/value pairs
260+
captured_bodies: dict[str, bytes] # Eagerly captured response bodies
259261
success: bool
260262
error: str | None
261263
```
@@ -397,7 +399,33 @@ This is more robust than Playwright's `networkidle` (500ms) — it waits for `_D
397399

398400
#### `--no-wait-for-data` Behavior
399401

400-
When disabled, no JS injection, no quiescence polling, and no `framenavigated` listener. A context-level route (`context.route("**/*", ...)`) still handles cache control. `page.goto()` with `wait_until="networkidle"` is the only wait mechanism.
402+
When disabled, no JS injection, no quiescence polling, and no `framenavigated` listener. `page.goto()` with `wait_until="networkidle"` is the only wait mechanism. The eager response body capture listener (`page.on("response")`) is always active regardless of this flag.
403+
404+
### Eager Response Body Capture
405+
406+
#### Problem
407+
408+
Playwright's `record_har_content="embed"` captures response bodies lazily — it calls CDP `Network.getResponseBody` when `context.close()` flushes the HAR to disk. If a navigation event causes Chrome to evict the response data from its network buffer before the flush, the body is lost. Headers, sizes, and timing are correct (captured synchronously from Network domain events), but `content.text` is absent and `content.size` is `-1`.
409+
410+
This typically affects the initial page load (e.g., a login form) when the user or JavaScript submits a form quickly, triggering a navigation that supersedes the first response.
411+
412+
#### Solution: Eager Body Capture via Response Listener
413+
414+
Before navigation, a `page.on("response")` listener is registered that eagerly calls `response.body()` for text-based content types (`text/*`, `application/json`, `application/xml`). Bodies are stored in `BrowserSessionResult.captured_bodies` keyed by `"<method>|<url>|<status>"`.
415+
416+
After Playwright writes the HAR and before metadata injection, `_patch_missing_bodies()` scans HAR entries for responses that have `bodySize > 0` or `_transferSize > 0` but no `content.text`. For each missing body, it looks up the key in the captured bodies cache and patches the body into the HAR entry.
417+
418+
Text bodies are stored as plain UTF-8 strings. Non-UTF-8 bodies fall back to base64 encoding with `content.encoding = "base64"`.
419+
420+
#### `_patch_missing_bodies(temp_path, captured_bodies) -> int`
421+
422+
- Reads the raw HAR from `temp_path`
423+
- For each entry missing `content.text` with `bodySize > 0` or `_transferSize > 0`: looks up `"<method>|<url>|<status>"` in `captured_bodies` and patches the body
424+
- Writes the patched HAR back to `temp_path`
425+
- Returns the number of entries patched
426+
- Handles corrupt HAR files gracefully (returns 0)
427+
428+
Testable with: zero mocks (real temp file).
401429

402430
### Timeout vs Interactive Mode
403431

@@ -459,9 +487,13 @@ Error messages are sanitized via `_sanitize_error_message()` to remove any embed
459487

460488
## Post-Capture Processing
461489

490+
### Body Patching
491+
492+
Immediately after the browser closes and writes the raw HAR, `_patch_missing_bodies()` scans for entries with missing response bodies and patches them from the eagerly captured body cache. This runs before metadata injection and sanitization so that downstream processing sees complete responses.
493+
462494
### Metadata Injection
463495

464-
After the browser closes and writes the raw HAR:
496+
After body patching:
465497

466498
```python
467499
har_data["log"]["_probes"] = probes_data

docs/specs/PATTERN_SPEC.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -306,7 +306,15 @@ Examples for `network-device`: `qam256`, `atdma`, `bpi+`, `honor mdd`, `dhcpclie
306306

307307
### Section: `pii.patterns`
308308

309-
Additional PII detection patterns. Same schema as `pii.json` `patterns` entries.
309+
Additional PII patterns for deterministic auto-redaction (Pass 0 of the HTML scanner). Same schema as `pii.json` `patterns` entries.
310+
311+
**Confidence requirement:** Every pattern runs as auto-redact — the matched value is replaced without user review. Patterns MUST achieve 100% confidence. If a pattern cannot meet this bar, use `heuristics.detectors` instead.
312+
313+
| Criterion | `pii.patterns` | `heuristics.detectors` |
314+
| -------------- | --------------------------- | --------------------------- |
315+
| Confidence | 100% — zero false positives | Lower confidence acceptable |
316+
| Action | Auto-redact (irreversible) | Flag for user review |
317+
| Pipeline stage | Pass 0 (scanner) | Heuristic engine |
310318

311319
### Built-in Domain: `network_device.json`
312320

docs/specs/SANITIZATION_SPEC.md

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -209,30 +209,33 @@ MIME-type based routing:
209209

210210
The engine runs sequential passes over HTML/JavaScript content (numbered 0–16 in the code, with sub-passes like 0b, 2b, 7a/7b, 8b). Each pass uses regex substitution with callback functions that invoke the hasher.
211211

212-
| Pass | Scanner | Pattern | Redaction |
213-
| ---- | ----------------------------- | ----------------------------------------------- | -------------------------------------- |
214-
| 0 | Custom patterns | Domain-specific PII regex | Per-pattern prefix |
215-
| 0b | Web storage | `localStorage.setItem('KEY', 'VALUE')` | Auto-redact if key is sensitive |
216-
| 1 | MAC addresses | `([0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}` | `hasher.hash_mac()` |
217-
| 2 | Serial numbers (inline) | `SN\|S/N\|Serial Number` + value | `hasher.hash_value(val, "SERIAL")` |
218-
| 2b | Serial numbers (table) | `<td>Label</td><td>VALUE</td>` | `hasher.hash_value(val, "SERIAL")` |
219-
| 3 | Account/subscriber IDs | `Account\|Subscriber\|Customer\|Device` + value | `hasher.hash_value(val, "ACCOUNT")` |
220-
| 4 | Private IPs | RFC 1918 ranges (preserves gateway IPs) | `hasher.hash_ip(ip, is_private=True)` |
221-
| 5 | Public IPs | Non-private, non-reserved | `hasher.hash_ip(ip, is_private=False)` |
222-
| 6 | IPv6 addresses | Full + compressed, validated via `ipaddress` | `hasher.hash_ipv6()` |
223-
| 7 | Passwords/passphrases | `password=value`, `passphrase=value` | `hasher.hash_value(val, "PASS")` |
224-
| 7a | SSID text labels | SSID labels in HTML text nodes | `hasher.hash_value(val, "WIFI")` |
225-
| 7b | JS password objects | JavaScript object password fields | `hasher.hash_value(val, "PASS")` |
226-
| 8 | Password inputs | `<input type="password" value="...">` | `hasher.hash_value(val, "PASS")` |
227-
| 8b | SSID inputs | SSID-related input fields | `hasher.hash_value(val, "WIFI")` |
228-
| 9 | Session tokens | 20+ char alphanumeric with label prefix | `hasher.hash_value(val, "TOKEN")` |
229-
| 10 | CSRF tokens | CSRF tokens in meta tags | `hasher.hash_value(val, "CSRF")` |
230-
| 11 | Email addresses | `user+tag@sub.domain.co.uk` | `hasher.hash_email()` |
231-
| 12 | Config paths | `.cfg` file references | `hasher.hash_value(val, "CONFIG")` |
232-
| 13 | Vendor JS vars | Motorola `var CurrentPw_24g = '...'` | `hasher.hash_value(val, "PASS")` |
233-
| 14 | Pipe-delimited (tagValueList) | `var name = "val1\|val2\|val3"` | Per-value heuristic analysis |
234-
| 15 | Pipe-delimited (other) | Other pipe-delimited variables | Per-value heuristic analysis |
235-
| 16 | SSID fields in JS | `ssid_24g: 'value'`, `guest_ssid: 'value'` | `hasher.hash_value(val, "WIFI")` |
212+
| Pass | Scanner | Pattern | Redaction |
213+
| ---- | ----------------------------- | --------------------------------------------------- | -------------------------------------- |
214+
| 0 | Custom patterns | Domain-specific PII regex | Per-pattern prefix |
215+
| 0b | Web storage | `localStorage.setItem('KEY', 'VALUE')` | Auto-redact if key is sensitive |
216+
| 1 | MAC addresses | `([0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}` | `hasher.hash_mac()` |
217+
| 2 | Serial numbers (inline) | `\bSN\b\|S/N\|Serial Number` + value | `hasher.hash_value(val, "SERIAL")` |
218+
| 2b | Serial numbers (table) | `<td>Label\b</td><td>VALUE</td>` | `hasher.hash_value(val, "SERIAL")` |
219+
| 2c | JS serial variables | Names with serial+Number/Num/No or ending in serial | `hasher.hash_value(val, "SERIAL")` |
220+
| 3 | Account/subscriber IDs | `Account\|Subscriber\|Customer\|Device` + value | `hasher.hash_value(val, "ACCOUNT")` |
221+
| 4 | Private IPs | RFC 1918 ranges (preserves gateway IPs) | `hasher.hash_ip(ip, is_private=True)` |
222+
| 5 | Public IPs | Non-private, non-reserved | `hasher.hash_ip(ip, is_private=False)` |
223+
| 6 | IPv6 addresses | Full + compressed, validated via `ipaddress` | `hasher.hash_ipv6()` |
224+
| 7 | Passwords/passphrases | `password=value`, `passphrase=value` | `hasher.hash_value(val, "PASS")` |
225+
| 7a | SSID text labels | SSID labels in HTML text nodes | `hasher.hash_value(val, "WIFI")` |
226+
| 7b | JS password objects | JavaScript object password fields | `hasher.hash_value(val, "PASS")` |
227+
| 8 | Password inputs | `<input type="password" value="...">` | `hasher.hash_value(val, "PASS")` |
228+
| 8b | SSID inputs | SSID-related input fields | `hasher.hash_value(val, "WIFI")` |
229+
| 9 | Session tokens | 20+ char alphanumeric with label prefix | `hasher.hash_value(val, "TOKEN")` |
230+
| 10 | CSRF tokens | CSRF tokens in meta tags | `hasher.hash_value(val, "CSRF")` |
231+
| 11 | Email addresses | `user+tag@sub.domain.co.uk` | `hasher.hash_email()` |
232+
| 12 | Config paths | `.cfg` file references | `hasher.hash_value(val, "CONFIG")` |
233+
| 13 | Vendor JS vars | Motorola `var CurrentPw_24g = '...'` | `hasher.hash_value(val, "PASS")` |
234+
| 14 | Pipe-delimited (tagValueList) | `var name = "val1\|val2\|val3"` | Per-value heuristic analysis |
235+
| 15 | Pipe-delimited (other) | Other pipe-delimited variables | Per-value heuristic analysis |
236+
| 16 | SSID fields in JS | `ssid_24g: 'value'`, `guest_ssid: 'value'` | `hasher.hash_value(val, "WIFI")` |
237+
238+
**Pass 2c precision rule:** Matches variable names containing the compound `serial` + `number`/`num`/`no` (with optional separator), and names ending with `serial`. Does NOT match `serial` followed by unrelated suffixes (`Protocol`, `Port`, `Baud`, `ization`). Bare `serial` is excluded — too ambiguous for auto-redact.
236239

237240
### Web Storage Scanner (Pass 0b)
238241

@@ -551,3 +554,4 @@ Detects common redaction markers to warn users before double-sanitizing.
551554
1. **Cookie metadata is preserved** — Cookie attributes (HttpOnly, Secure, SameSite, Path, Domain, Expires) are detected and not redacted. Only cookie values are redacted.
552555
1. **Credit card detection requires Luhn** — A 16-digit number is only redacted as a credit card if it passes Luhn checksum validation.
553556
1. **Global find-replace in Pass 2** — User-selected redactions are applied via string replacement on the serialized JSON, ensuring all occurrences (headers, body, URLs) are caught.
557+
1. **Scanner passes require 100% confidence** — Every regex in the HTML scanner pipeline (passes 0–16) auto-redacts without user review. A pattern that produces false positives is a bug. Patterns that cannot achieve 100% confidence belong in the heuristic engine (flagged for user review), not the scanner pipeline.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "har-capture"
7-
version = "0.6.0"
7+
version = "0.6.1"
88
description = "HAR capture and PII sanitization library for network traffic analysis"
99
readme = "README.md"
1010
license = "MIT"

src/har_capture/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424

2525
from __future__ import annotations
2626

27-
__version__ = "0.6.0"
27+
__version__ = "0.6.1"
2828

2929
# Re-export public API for convenience
3030
from har_capture.sanitization import (

0 commit comments

Comments
 (0)