Skip to content

Commit e5d95c1

Browse files
kwschulzclaude
andauthored
refactor(capture): decompose capture_device_har, harden pre-flight, v0.6.0 (#36)
* refactor(capture): decompose capture_device_har into testable units, harden pre-flight Decompose the 460-line capture_device_har() god function into four independently testable units: - _resolve_capture_paths(): pure filesystem/parsing (zero mocks to test) - _run_browser_session(): Playwright lifecycle → BrowserSessionResult - _inject_har_metadata(): probe/cookie/storage injection (zero mocks) - _run_post_capture_pipeline(): sanitize/compress/cleanup (zero mocks) Eliminate nonlocal data shuttling via BrowserSessionResult dataclass. capture_device_har() is now a ~60-line orchestrator that delegates. Also includes: - Session contamination check (check_session_contamination) to detect live sessions before capture - Connectivity module: extract _urlopen_with_ssl, require explicit scheme (http:// or https://), reject bare hostnames - Workflow: add session check phase, probe phase improvements - CLI: explicit scheme requirement, --minimal mode refinements - Sanitization: web storage scanning, additional PII patterns - Validation: additional secret detection patterns - Architecture decisions document (ADR-1 through ADR-5) - 20+ new unit tests for extracted functions (zero Playwright mocking) - browser.py coverage 76% → 85% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(release): bump version to 0.6.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(tests): add http:// scheme to integration test URLs Integration tests passed bare IPs (e.g., 127.0.0.1:port) to capture_device_har() and check_device_connectivity(), which now require explicit http:// or https:// schemes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ken Schulz <kwschulz@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fc54218 commit e5d95c1

31 files changed

Lines changed: 3114 additions & 590 deletions

CHANGELOG.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.6.0] - 2026-04-06
11+
12+
### Changed
13+
14+
- **Decomposed `capture_device_har()`** — The 460-line god function is now a ~60-line orchestrator that delegates to four independently testable units: `_resolve_capture_paths()` (pure filesystem), `_run_browser_session()` (Playwright lifecycle), `_inject_har_metadata()` (HAR enrichment), `_run_post_capture_pipeline()` (sanitize/compress/cleanup). Eliminates `nonlocal` data shuttling via new `BrowserSessionResult` dataclass. Public API signature unchanged.
15+
- **Explicit scheme required**`har-capture` now requires `http://` or `https://` in the target URL (e.g., `har-capture http://192.168.1.1`). Bare hostnames/IPs are rejected with a helpful error message. This eliminates ambiguity and prevents duplicate connectivity probes.
16+
- **Connectivity module hardened** — Extracted `_urlopen_with_ssl()` shared helper, eliminating 3 copies of the urllib + SSL context pattern across `check_device_connectivity`, `check_basic_auth`, and `check_session_contamination`.
17+
18+
### Added
19+
20+
- **Session contamination check** — New `check_session_contamination()` detects live sessions before capture by inspecting the unauthenticated response for login-page indicators. Prevents captures that skip the auth flow because the device already has a session. Added as Phase 3 in the workflow.
21+
- **Architecture Decisions document** — ADR-1 through ADR-5 covering minimal pre-flight, interactive mode, probe opt-in, explicit scheme, and session contamination guard.
22+
- **`CapturePathInfo` dataclass** — Encapsulates resolved output path, sanitized path, temp file path, hostname, and target URL.
23+
- **`BrowserSessionResult` dataclass** — Encapsulates all browser-captured state (cookies, localStorage, sessionStorage, pre-capture cookie audit).
24+
- **20+ new unit tests** — Tests for extracted functions use zero `@patch` decorators and real temp files. browser.py coverage 76% → 85%.
25+
26+
### Fixed
27+
28+
- **Orphaned temp file on connectivity failure**`_resolve_capture_paths()` creates the temp file before the connectivity check. If connectivity fails, the temp file is now cleaned up before the early return.
29+
1030
## [0.5.1] - 2026-03-30
1131

1232
### Fixed
@@ -443,4 +463,5 @@ har-capture sanitize input.har --patterns custom-allowlist.json
443463
[0.4.5]: https://github.com/solentlabs/har-capture/compare/v0.4.4...v0.4.5
444464
[0.5.0]: https://github.com/solentlabs/har-capture/compare/v0.4.5...v0.5.0
445465
[0.5.1]: https://github.com/solentlabs/har-capture/compare/v0.5.0...v0.5.1
446-
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.5.1...HEAD
466+
[0.6.0]: https://github.com/solentlabs/har-capture/compare/v0.5.1...v0.6.0
467+
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.6.0...HEAD

docs/ARCHITECTURE.md

Lines changed: 62 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -42,44 +42,93 @@ Domain-specific knowledge — what values are safe, what patterns indicate sensi
4242

4343
## Capture Pipeline
4444

45-
### Phase 1: Browser Check
45+
Capture is user-driven: the user launches a browser, interacts with the target site naturally (login, navigate pages), and closes the browser when done. har-capture records everything and sanitizes the result.
4646

47-
Verifies that Playwright is installed and the requested browser engine (chromium, firefox, or webkit) is available. If the browser executable is missing, the CLI prompts the user to download it (~150 MB one-time install). If system dependencies are missing (libasound, libnss3, libnspr4), the error is detected by pattern matching and the user is guided to install them.
47+
### Default Workflow
4848

49-
### Phase 2: Connectivity Check
49+
The default path minimizes pre-flight HTTP requests so the tool works with session-constrained devices (see [ADR-2](ARCHITECTURE_DECISIONS.md#adr-2-minimal-pre-flight-in-interactive-mode)).
5050

51-
Determines whether the target is reachable and which HTTP scheme to use. Tries `http` then `https` (or user-specified scheme only). A 401/403 counts as "reachable." Self-signed certificates are accepted.
51+
```mermaid
52+
graph TD
53+
start([har-capture URL]) --> browser_check{Browser installed?}
54+
browser_check -->|No| install[Prompt install ~150MB]
55+
install --> conn
56+
browser_check -->|Yes| conn
57+
58+
conn[Connectivity Check<br>1 GET → validate reachability]
59+
conn --> session{Session Check<br>1 GET → detect live session}
60+
session -->|Contaminated| abort([ABORT: clear cookies])
61+
session -->|Clean| has_creds{Credentials provided?}
62+
63+
has_creds -->|No| launch[Launch Browser<br>Clean context: empty storage_state]
64+
has_creds -->|Yes| probe[Auth Probe<br>1 GET → capture 401 headers]
65+
probe --> launch
66+
67+
launch --> goto{page.goto networkidle<br>15s timeout}
68+
goto -->|Resolves| user[User interacts<br>Login, navigate, close browser]
69+
goto -->|Timeout| fallback[Auto-fallback to domcontentloaded<br>Disable wait-for-data]
70+
fallback --> user
71+
user --> process
72+
73+
subgraph process[Post-Capture Processing]
74+
direction TB
75+
meta[Inject metadata + pre_capture_cookies] --> sanitize[Pass 1: Auto-sanitize PII]
76+
sanitize --> review[Pass 2: Interactive review]
77+
review --> filter[Filter bloat + deduplicate]
78+
filter --> compress[Gzip compress]
79+
compress --> cleanup[Delete temp files]
80+
end
81+
82+
process --> done([.sanitized.har.gz])
83+
```
5284

53-
### Phase 3: Pre-Capture Probes
85+
**Pre-flight HTTP requests:** 2 (no credentials) or 3 (with `--username`/`--password`).
5486

55-
Three diagnostic probes (auth challenge, HEAD support, ICMP ping) gather metadata embedded in the final HAR as `_probes`.
87+
### Workflow with `--minimal`
88+
89+
For devices that allow only one concurrent session (e.g., Compal CH7465MT), `--minimal` skips probes and auth detection, defers the connectivity check into `capture_device_har()`, and uses a lenient page load strategy.
90+
91+
```mermaid
92+
graph TD
93+
start(["har-capture URL --minimal"]) --> browser_check{Browser installed?}
94+
browser_check -->|No| install[Prompt install]
95+
install --> conn
96+
browser_check -->|Yes| conn
97+
98+
conn["Connectivity Check<br>1 GET inside capture_device_har()"]
99+
conn --> launch["Launch Browser<br>domcontentloaded strategy<br>wait-for-data disabled"]
100+
launch --> user[User interacts<br>Login, navigate, close browser]
101+
user --> process[Post-Capture Processing]
102+
process --> done([.sanitized.har.gz])
103+
```
56104

57-
### Phase 4: Auth Detection
105+
**Pre-flight HTTP requests:** 1 (connectivity check runs inside `capture_device_har()` rather than as a separate CLI phase).
58106

59-
Detects HTTP Basic Auth (401 + `WWW-Authenticate: Basic`) vs in-browser auth (form/HNAP). Basic Auth credentials are passed to Playwright's `http_credentials` context option; in-browser auth requires interactive mode.
107+
### Why Probes Are Not Default
60108

61-
See [Capture Spec](specs/CAPTURE_SPEC.md) for phase internals (connectivity detection, auth detection, probe details).
109+
Pre-capture probes (auth challenge, HEAD support, ICMP) capture metadata that Playwright would otherwise suppress when `http_credentials` is set. In interactive mode without credentials, the browser handles auth dialogs natively and the full HTTP exchange (including 401 responses) is recorded in the HAR. Probes auto-run only when the user provides `--username`/`--password`, which triggers Playwright's `http_credentials` and suppresses the 401. See [ADR-3](ARCHITECTURE_DECISIONS.md#adr-3-probes-are-opt-in-diagnostics).
62110

63-
### Phase 5: Browser Capture
111+
### Browser Capture Detail
64112

65113
The core Playwright session. Key design decisions:
66114

115+
- **Clean context**: `storage_state={"cookies": [], "origins": []}` forces an empty cookie jar — no inherited session cookies or credentials
67116
- **Temp file**: Raw HAR (containing PII) is written to `/tmp` via `mkstemp()`, never to the user's working directory
68117
- **Embedded content**: Response bodies are base64-encoded within the HAR
69118
- **Service worker blocking**: Prevents cached responses from interfering
70119
- **HTTPS tolerance**: Self-signed/expired device certificates accepted
71120

72-
**Wait-for-data**: An init script monkey-patches `XMLHttpRequest.send` and `window.fetch` to track in-flight requests via `window.__harCapturePendingRequests`. After each navigation, the system polls this counter until 2 seconds of network silence (vs Playwright's 500ms `networkidle`). A `framenavigated` event listener ensures async data completes before page transitions.
121+
**Wait-for-data**: An init script monkey-patches `XMLHttpRequest.send` and `window.fetch` to track in-flight requests via `window.__harCapturePendingRequests`. After each navigation, the system polls this counter until 2 seconds of network silence (vs Playwright's 500ms `networkidle`). A `framenavigated` event listener ensures async data completes before page transitions. Disabled in `--minimal` mode for devices with persistent connections.
73122

74123
**State capture**: After navigation, cookies (`context.cookies()`), localStorage (`context.storage_state()`), and sessionStorage (JS evaluation) are captured and injected into the HAR as `_har_capture` metadata.
75124

76125
**Error recovery**: Missing browser executables and system dependencies are detected by pattern matching, fixed automatically (reinstall), and retried once.
77126

78127
See [Capture Spec](specs/CAPTURE_SPEC.md) for full details (context config, wait-for-data mechanism, timing constants, timeout vs interactive mode).
79128

80-
### Phase 6: Post-Capture Processing
129+
### Post-Capture Processing
81130

82-
After the browser closes: metadata injection (probes, cookies, storage, tool version) → sanitization (Pass 1) → bloat filtering + deduplication → gzip compression → temp file cleanup.
131+
After the browser closes: metadata injection (probes, cookies, storage, tool version, `_solentlabs.pre_capture_cookies` audit) → sanitization (Pass 1) → interactive review (Pass 2) → bloat filtering + deduplication → gzip compression → temp file cleanup.
83132

84133
The raw temp file is **always** deleted, ensuring PII doesn't persist on disk. See [Capture Spec](specs/CAPTURE_SPEC.md#post-capture-processing) for the full processing pipeline and file cleanup rules.
85134

docs/ARCHITECTURE_DECISIONS.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# Architecture Decisions
2+
3+
Design rationale and "why" behind architectural choices. Each decision
4+
records the context, the choice made, and the reasoning — so future
5+
contributors understand the tradeoffs rather than just the outcome.
6+
7+
## ADR-1: Capture is User-Driven, Not Automated
8+
9+
**Context:** har-capture's primary purpose is to help a user sanitize and
10+
package observed browser traffic so a downstream system (like
11+
cable_modem_monitor) can reverse-engineer device APIs. The user navigates
12+
the device's web interface, logs in, visits pages — the tool records
13+
everything.
14+
15+
**Decision:** The default capture mode is interactive. The user drives the
16+
browser. har-capture records, sanitizes, and packages.
17+
18+
**Consequence:** The tool should not attempt to automate device interaction
19+
(login, navigation) in the default path. Automated/headless mode exists for
20+
CI and advanced users but is not the primary use case.
21+
22+
## ADR-2: Minimal Pre-Flight in Interactive Mode
23+
24+
**Context:** The capture workflow originally ran 5 HTTP requests before
25+
opening Playwright: connectivity check, auth challenge probe, HEAD probe,
26+
ICMP ping, and auth detection. For devices that allow only one concurrent
27+
session (e.g., Compal CH7465MT), these requests exhaust the session slot
28+
before the browser opens.
29+
30+
**Decision:** Interactive mode should make the fewest possible pre-flight
31+
HTTP requests. The browser handles auth dialogs, redirects, and errors
32+
naturally — the user is present to respond.
33+
34+
- **Connectivity check (1 GET):** Retained. Validates the device is reachable
35+
on the user-provided scheme before launching Playwright. The URL must
36+
include an explicit `http://` or `https://` scheme. Without the check,
37+
Playwright would hang silently on unreachable devices.
38+
- **Auth detection:** Not needed in interactive mode. When a device responds
39+
with 401, Playwright shows a native Basic Auth dialog. The user enters
40+
credentials. Both the 401 and the authenticated retry are captured in the
41+
HAR — the full auth exchange is recorded.
42+
- **Probes (auth challenge, HEAD, ICMP):** Not needed in interactive mode.
43+
Probe data (401 headers, Set-Cookie) is captured naturally in the HAR when
44+
the browser navigates. Probes were added because Playwright's
45+
`http_credentials` suppresses the 401 — but interactive mode does not use
46+
`http_credentials`.
47+
48+
**`--minimal` flag:** For the edge case where even the single connectivity
49+
check is problematic, `--minimal` reduces pre-flight further:
50+
skips probes, skips auth detection, uses `domcontentloaded` page load
51+
strategy, disables wait-for-data.
52+
53+
**Headless/automated mode** (`--headless --timeout N`) is the exception: no
54+
human is present, so auth detection and probes are necessary. The user
55+
requesting headless mode implicitly accepts the connection overhead.
56+
57+
**Consequence:** The default interactive capture goes from 5 pre-flight
58+
HTTP requests to 1. Most devices work without any flags.
59+
60+
## ADR-3: Probes Are Opt-In Diagnostics
61+
62+
**Context:** Pre-capture probes capture the device's 401 response,
63+
`WWW-Authenticate` headers, and `Set-Cookie` data. cable_modem_monitor's
64+
intake pipeline uses this to reverse-engineer auth patterns. But the probe
65+
data is also present in the HAR itself (the browser's first request to a
66+
401 endpoint is recorded).
67+
68+
**Decision:** Probes are opt-in via `--diagnostics`. The HAR already
69+
contains the auth exchange from the browser's natural interaction. Probes
70+
add a pre-Playwright snapshot that's useful when `http_credentials`
71+
suppresses the 401, which only happens in automated mode.
72+
73+
**Consequence:** har-capture stays domain-agnostic. Downstream consumers
74+
(CMM intake) request `--diagnostics` when they need probe metadata. Default
75+
users don't pay the connection cost.
76+
77+
## ADR-4: Auto-Fallback for Persistent-Connection Devices
78+
79+
**Context:** `page.goto(url, wait_until="networkidle")` requires 500ms of
80+
zero network activity. Some devices (Compal CH7465MT) keep persistent
81+
polling/heartbeat connections, so `networkidle` never resolves. The
82+
`wait-for-data` mechanism (2s of zero pending XHR/fetch) has the same
83+
problem.
84+
85+
**Decision:** Auto-detect and fall back. The initial `page.goto` uses
86+
`networkidle` with a 15-second timeout. If it times out (the definitive
87+
signal that the device has persistent connections), the system:
88+
89+
1. Falls back to `domcontentloaded` (the page is already loaded — the wait
90+
condition failed, not the navigation)
91+
1. Disables quiescence checks for the rest of the session
92+
1. Logs the fallback so the user knows what happened
93+
94+
This is the same pattern as protocol negotiation — try the better option,
95+
catch the definitive failure, fall back. No heuristics, no guessing.
96+
97+
Normal devices resolve `networkidle` in under 5 seconds. The 15-second
98+
timeout gives headroom for slow devices while catching persistent-connection
99+
devices without excessive wait.
100+
101+
**Consequence:** The user never needs to know about page load strategies.
102+
The tool auto-adapts. `--minimal` remains as an escape hatch for the rare
103+
case where even the auto-fallback's 15-second wait is unacceptable, or
104+
where the connectivity check's single GET exhausts the device's session.
105+
106+
## ADR-5: Domain-Agnostic Core, Domain Knowledge via Data
107+
108+
**Context:** har-capture serves multiple consumers (cable modem monitor,
109+
printer admin panels, IoT hubs, SaaS dashboards). Device-specific knowledge
110+
(safe values, heuristic detectors, HTML scanner config) varies across
111+
domains.
112+
113+
**Decision:** The sanitization engine has no knowledge of any particular
114+
device. Domain knowledge is loaded from JSON pattern files at runtime via
115+
`--patterns`. Core pattern files (`pii.json`, `sensitive.json`,
116+
`allowlist.json`) contain only universal PII rules.
117+
118+
**Consequence:** Adding support for a new product category requires a JSON
119+
file, not code changes. Consumers ship their own pattern files.
120+
121+
## ADR-6: Two-Pass Sanitization Model
122+
123+
**Context:** Automated PII detection has false positives. Aggressive
124+
auto-redaction can destroy debugging utility. Conservative detection misses
125+
real PII.
126+
127+
**Decision:** Pass 1 auto-redacts high-confidence PII (MACs, IPs, emails,
128+
passwords, tokens). Pass 2 presents ambiguous values for interactive review
129+
— the user sees the value, its context, why it was flagged, and decides
130+
whether to redact.
131+
132+
**Consequence:** The tool is safe by default (Pass 1 catches universal PII)
133+
while giving the user control over edge cases. Non-interactive mode
134+
(CI/headless) writes flagged values to a JSON report instead.
135+
136+
## ADR-7: XML POST Bodies Are Sanitized via Two Layers
137+
138+
**Context:** Devices with XML APIs (e.g., Compal CH7465MT) send POST
139+
bodies with `text/xml` or `application/xml` MIME types containing session
140+
tokens, encrypted credentials, and device data.
141+
142+
**Decision:** XML POST body sanitization uses two layers:
143+
144+
1. `_sanitize_xml_fields()` — Parses XML, checks element tag names and
145+
attribute names against sensitive field patterns, redacts matching values.
146+
This mirrors how the JSON and form-urlencoded handlers check field names.
147+
1. `sanitize_html()` — The existing 17-pass scanner runs on the XML text
148+
to catch pattern-based PII (MACs, IPs, emails) that field-name checking
149+
misses.
150+
151+
**Consequence:** Both field-name-based and pattern-based PII are caught.
152+
The HTML scanner already handles XML content (used for `text/xml`
153+
responses), so no new engine is needed. Malformed XML falls through
154+
gracefully.
155+
156+
## ADR-8: Duplicate Connectivity Check Eliminated
157+
158+
**Context:** `capture_device_har()` internally called
159+
`check_device_connectivity()` to determine the URL scheme. But the CLI
160+
workflow already called this in Phase 2. The result was two identical GET
161+
requests before the browser opened.
162+
163+
**Decision:** The CLI now passes the pre-computed `target_url` to
164+
`capture_device_har()`. When provided, the internal connectivity check is
165+
skipped. When called directly (library API without CLI), the check still
166+
runs.
167+
168+
**Consequence:** One fewer pre-flight HTTP request for all capture modes.
169+
Library API backward-compatible (new parameter has a `None` default).
170+
171+
## ADR-9: Session Contamination Guard
172+
173+
**Context:** 6 of 36 catalog HARs in the MCP intake pipeline failed
174+
validation because the browser had an existing session when capture
175+
started. Failure signatures: first request carries `Secure`,
176+
`XSRF_TOKEN`, `PHPSESSID` session cookies, or `Authorization` headers.
177+
The login flow is missing, making the HAR useless for auth analysis.
178+
179+
**Decision:** Three layered defenses, in priority order:
180+
181+
1. **Force clean browser context.** `storage_state={"cookies": [], "origins": []}` is set on every Playwright context. This prevents
182+
all cookie/credential inheritance regardless of how the browser was
183+
launched. This single change prevents all 6 failure signatures.
184+
185+
1. **Pre-flight session check.** Before launching Playwright, an
186+
unauthenticated GET checks whether the device serves data content
187+
(no login page). If so, the device has a live session from another
188+
source (another tab, previous connection from the same IP), and the
189+
workflow aborts with a clear message. Skipped in `--minimal` mode.
190+
191+
1. **Pre-capture cookie audit.** `context.cookies()` is called
192+
immediately after context creation and before any navigation. The
193+
result is emitted as `_solentlabs.pre_capture_cookies` in the HAR.
194+
With the clean storage state, this should always be empty — a
195+
non-empty list is a diagnostic signal for downstream tools.
196+
197+
**Consequence:** The default workflow adds one pre-flight GET (session
198+
check). `--minimal` skips it for session-constrained devices. The
199+
pre-capture cookie audit has zero network cost — it reads local context
200+
state. All three defenses are additive and composable.

0 commit comments

Comments
 (0)