solentlabs
diff --git a/‎CHANGELOG.md‎
Lines changed: 22 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 22 additions & 1 deletion
diff --git a/‎docs/ARCHITECTURE.md‎
Lines changed: 62 additions & 13 deletions b/‎docs/ARCHITECTURE.md‎
Lines changed: 62 additions & 13 deletions
diff --git a/‎docs/ARCHITECTURE_DECISIONS.md‎
Lines changed: 200 additions & 0 deletions b/‎docs/ARCHITECTURE_DECISIONS.md‎
Lines changed: 200 additions & 0 deletions
@@ -7,6 +7,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.6.0] - 2026-04-06
+
+### Changed
+
+- **Decomposed `capture_device_har()`** — The 460-line god function is now a ~60-line orchestrator that delegates to four independently testable units: `_resolve_capture_paths()` (pure filesystem), `_run_browser_session()` (Playwright lifecycle), `_inject_har_metadata()` (HAR enrichment), `_run_post_capture_pipeline()` (sanitize/compress/cleanup). Eliminates `nonlocal` data shuttling via new `BrowserSessionResult` dataclass. Public API signature unchanged.
+- **Explicit scheme required** — `har-capture` now requires `http://` or `https://` in the target URL (e.g., `har-capture http://192.168.1.1`). Bare hostnames/IPs are rejected with a helpful error message. This eliminates ambiguity and prevents duplicate connectivity probes.
+- **Connectivity module hardened** — Extracted `_urlopen_with_ssl()` shared helper, eliminating 3 copies of the urllib + SSL context pattern across `check_device_connectivity`, `check_basic_auth`, and `check_session_contamination`.
+
+### Added
+
+- **Session contamination check** — New `check_session_contamination()` detects live sessions before capture by inspecting the unauthenticated response for login-page indicators. Prevents captures that skip the auth flow because the device already has a session. Added as Phase 3 in the workflow.
+- **Architecture Decisions document** — ADR-1 through ADR-5 covering minimal pre-flight, interactive mode, probe opt-in, explicit scheme, and session contamination guard.
+- **`CapturePathInfo` dataclass** — Encapsulates resolved output path, sanitized path, temp file path, hostname, and target URL.
+- **`BrowserSessionResult` dataclass** — Encapsulates all browser-captured state (cookies, localStorage, sessionStorage, pre-capture cookie audit).
+- **20+ new unit tests** — Tests for extracted functions use zero `@patch` decorators and real temp files. browser.py coverage 76% → 85%.
+
+### Fixed
+
+- **Orphaned temp file on connectivity failure** — `_resolve_capture_paths()` creates the temp file before the connectivity check. If connectivity fails, the temp file is now cleaned up before the early return.
+
 ## [0.5.1] - 2026-03-30
 
 ### Fixed
@@ -443,4 +463,5 @@ har-capture sanitize input.har --patterns custom-allowlist.json
 [0.4.5]: https://github.com/solentlabs/har-capture/compare/v0.4.4...v0.4.5
 [0.5.0]: https://github.com/solentlabs/har-capture/compare/v0.4.5...v0.5.0
 [0.5.1]: https://github.com/solentlabs/har-capture/compare/v0.5.0...v0.5.1
-[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.5.1...HEAD
+[0.6.0]: https://github.com/solentlabs/har-capture/compare/v0.5.1...v0.6.0
+[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.6.0...HEAD
@@ -42,44 +42,93 @@ Domain-specific knowledge — what values are safe, what patterns indicate sensi
 
 ## Capture Pipeline
 
-### Phase 1: Browser Check
+Capture is user-driven: the user launches a browser, interacts with the target site naturally (login, navigate pages), and closes the browser when done. har-capture records everything and sanitizes the result.
 
-Verifies that Playwright is installed and the requested browser engine (chromium, firefox, or webkit) is available. If the browser executable is missing, the CLI prompts the user to download it (~150 MB one-time install). If system dependencies are missing (libasound, libnss3, libnspr4), the error is detected by pattern matching and the user is guided to install them.
+### Default Workflow
 
-### Phase 2: Connectivity Check
+The default path minimizes pre-flight HTTP requests so the tool works with session-constrained devices (see [ADR-2](ARCHITECTURE_DECISIONS.md#adr-2-minimal-pre-flight-in-interactive-mode)).
 
-Determines whether the target is reachable and which HTTP scheme to use. Tries `http` then `https` (or user-specified scheme only). A 401/403 counts as "reachable." Self-signed certificates are accepted.
+```mermaid
+graph TD
+    start([har-capture URL]) --> browser_check{Browser installed?}
+    browser_check -->|No| install[Prompt install ~150MB]
+    install --> conn
+    browser_check -->|Yes| conn
+
+    conn[Connectivity Check<br>1 GET → validate reachability]
+    conn --> session{Session Check<br>1 GET → detect live session}
+    session -->|Contaminated| abort([ABORT: clear cookies])
+    session -->|Clean| has_creds{Credentials provided?}
+
+    has_creds -->|No| launch[Launch Browser<br>Clean context: empty storage_state]
+    has_creds -->|Yes| probe[Auth Probe<br>1 GET → capture 401 headers]
+    probe --> launch
+
+    launch --> goto{page.goto networkidle<br>15s timeout}
+    goto -->|Resolves| user[User interacts<br>Login, navigate, close browser]
+    goto -->|Timeout| fallback[Auto-fallback to domcontentloaded<br>Disable wait-for-data]
+    fallback --> user
+    user --> process
+
+    subgraph process[Post-Capture Processing]
+        direction TB
+        meta[Inject metadata + pre_capture_cookies] --> sanitize[Pass 1: Auto-sanitize PII]
+        sanitize --> review[Pass 2: Interactive review]
+        review --> filter[Filter bloat + deduplicate]
+        filter --> compress[Gzip compress]
+        compress --> cleanup[Delete temp files]
+    end
+
+    process --> done([.sanitized.har.gz])
+```
 
-### Phase 3: Pre-Capture Probes
+**Pre-flight HTTP requests:** 2 (no credentials) or 3 (with `--username`/`--password`).
 
-Three diagnostic probes (auth challenge, HEAD support, ICMP ping) gather metadata embedded in the final HAR as `_probes`.
+### Workflow with `--minimal`
+
+For devices that allow only one concurrent session (e.g., Compal CH7465MT), `--minimal` skips probes and auth detection, defers the connectivity check into `capture_device_har()`, and uses a lenient page load strategy.
+
+```mermaid
+graph TD
+    start(["har-capture URL --minimal"]) --> browser_check{Browser installed?}
+    browser_check -->|No| install[Prompt install]
+    install --> conn
+    browser_check -->|Yes| conn
+
+    conn["Connectivity Check<br>1 GET inside capture_device_har()"]
+    conn --> launch["Launch Browser<br>domcontentloaded strategy<br>wait-for-data disabled"]
+    launch --> user[User interacts<br>Login, navigate, close browser]
+    user --> process[Post-Capture Processing]
+    process --> done([.sanitized.har.gz])
+```
 
-### Phase 4: Auth Detection
+**Pre-flight HTTP requests:** 1 (connectivity check runs inside `capture_device_har()` rather than as a separate CLI phase).
 
-Detects HTTP Basic Auth (401 + `WWW-Authenticate: Basic`) vs in-browser auth (form/HNAP). Basic Auth credentials are passed to Playwright's `http_credentials` context option; in-browser auth requires interactive mode.
+### Why Probes Are Not Default
 
-See [Capture Spec](specs/CAPTURE_SPEC.md) for phase internals (connectivity detection, auth detection, probe details).
+Pre-capture probes (auth challenge, HEAD support, ICMP) capture metadata that Playwright would otherwise suppress when `http_credentials` is set. In interactive mode without credentials, the browser handles auth dialogs natively and the full HTTP exchange (including 401 responses) is recorded in the HAR. Probes auto-run only when the user provides `--username`/`--password`, which triggers Playwright's `http_credentials` and suppresses the 401. See [ADR-3](ARCHITECTURE_DECISIONS.md#adr-3-probes-are-opt-in-diagnostics).
 
-### Phase 5: Browser Capture
+### Browser Capture Detail
 
 The core Playwright session. Key design decisions:
 
+- **Clean context**: `storage_state={"cookies": [], "origins": []}` forces an empty cookie jar — no inherited session cookies or credentials
 - **Temp file**: Raw HAR (containing PII) is written to `/tmp` via `mkstemp()`, never to the user's working directory
 - **Embedded content**: Response bodies are base64-encoded within the HAR
 - **Service worker blocking**: Prevents cached responses from interfering
 - **HTTPS tolerance**: Self-signed/expired device certificates accepted
 
-**Wait-for-data**: An init script monkey-patches `XMLHttpRequest.send` and `window.fetch` to track in-flight requests via `window.__harCapturePendingRequests`. After each navigation, the system polls this counter until 2 seconds of network silence (vs Playwright's 500ms `networkidle`). A `framenavigated` event listener ensures async data completes before page transitions.
+**Wait-for-data**: An init script monkey-patches `XMLHttpRequest.send` and `window.fetch` to track in-flight requests via `window.__harCapturePendingRequests`. After each navigation, the system polls this counter until 2 seconds of network silence (vs Playwright's 500ms `networkidle`). A `framenavigated` event listener ensures async data completes before page transitions. Disabled in `--minimal` mode for devices with persistent connections.
 
 **State capture**: After navigation, cookies (`context.cookies()`), localStorage (`context.storage_state()`), and sessionStorage (JS evaluation) are captured and injected into the HAR as `_har_capture` metadata.
 
 **Error recovery**: Missing browser executables and system dependencies are detected by pattern matching, fixed automatically (reinstall), and retried once.
 
 See [Capture Spec](specs/CAPTURE_SPEC.md) for full details (context config, wait-for-data mechanism, timing constants, timeout vs interactive mode).
 
-### Phase 6: Post-Capture Processing
+### Post-Capture Processing
 
-After the browser closes: metadata injection (probes, cookies, storage, tool version) → sanitization (Pass 1) → bloat filtering + deduplication → gzip compression → temp file cleanup.
+After the browser closes: metadata injection (probes, cookies, storage, tool version, `_solentlabs.pre_capture_cookies` audit) → sanitization (Pass 1) → interactive review (Pass 2) → bloat filtering + deduplication → gzip compression → temp file cleanup.
 
 The raw temp file is **always** deleted, ensuring PII doesn't persist on disk. See [Capture Spec](specs/CAPTURE_SPEC.md#post-capture-processing) for the full processing pipeline and file cleanup rules.
 
 
@@ -0,0 +1,200 @@
+# Architecture Decisions
+
+Design rationale and "why" behind architectural choices. Each decision
+records the context, the choice made, and the reasoning — so future
+contributors understand the tradeoffs rather than just the outcome.
+
+## ADR-1: Capture is User-Driven, Not Automated
+
+**Context:** har-capture's primary purpose is to help a user sanitize and
+package observed browser traffic so a downstream system (like
+cable_modem_monitor) can reverse-engineer device APIs. The user navigates
+the device's web interface, logs in, visits pages — the tool records
+everything.
+
+**Decision:** The default capture mode is interactive. The user drives the
+browser. har-capture records, sanitizes, and packages.
+
+**Consequence:** The tool should not attempt to automate device interaction
+(login, navigation) in the default path. Automated/headless mode exists for
+CI and advanced users but is not the primary use case.
+
+## ADR-2: Minimal Pre-Flight in Interactive Mode
+
+**Context:** The capture workflow originally ran 5 HTTP requests before
+opening Playwright: connectivity check, auth challenge probe, HEAD probe,
+ICMP ping, and auth detection. For devices that allow only one concurrent
+session (e.g., Compal CH7465MT), these requests exhaust the session slot
+before the browser opens.
+
+**Decision:** Interactive mode should make the fewest possible pre-flight
+HTTP requests. The browser handles auth dialogs, redirects, and errors
+naturally — the user is present to respond.
+
+- **Connectivity check (1 GET):** Retained. Validates the device is reachable
+  on the user-provided scheme before launching Playwright. The URL must
+  include an explicit `http://` or `https://` scheme. Without the check,
+  Playwright would hang silently on unreachable devices.
+- **Auth detection:** Not needed in interactive mode. When a device responds
+  with 401, Playwright shows a native Basic Auth dialog. The user enters
+  credentials. Both the 401 and the authenticated retry are captured in the
+  HAR — the full auth exchange is recorded.
+- **Probes (auth challenge, HEAD, ICMP):** Not needed in interactive mode.
+  Probe data (401 headers, Set-Cookie) is captured naturally in the HAR when
+  the browser navigates. Probes were added because Playwright's
+  `http_credentials` suppresses the 401 — but interactive mode does not use
+  `http_credentials`.
+
+**`--minimal` flag:** For the edge case where even the single connectivity
+check is problematic, `--minimal` reduces pre-flight further:
+skips probes, skips auth detection, uses `domcontentloaded` page load
+strategy, disables wait-for-data.
+
+**Headless/automated mode** (`--headless --timeout N`) is the exception: no
+human is present, so auth detection and probes are necessary. The user
+requesting headless mode implicitly accepts the connection overhead.
+
+**Consequence:** The default interactive capture goes from 5 pre-flight
+HTTP requests to 1. Most devices work without any flags.
+
+## ADR-3: Probes Are Opt-In Diagnostics
+
+**Context:** Pre-capture probes capture the device's 401 response,
+`WWW-Authenticate` headers, and `Set-Cookie` data. cable_modem_monitor's
+intake pipeline uses this to reverse-engineer auth patterns. But the probe
+data is also present in the HAR itself (the browser's first request to a
+401 endpoint is recorded).
+
+**Decision:** Probes are opt-in via `--diagnostics`. The HAR already
+contains the auth exchange from the browser's natural interaction. Probes
+add a pre-Playwright snapshot that's useful when `http_credentials`
+suppresses the 401, which only happens in automated mode.
+
+**Consequence:** har-capture stays domain-agnostic. Downstream consumers
+(CMM intake) request `--diagnostics` when they need probe metadata. Default
+users don't pay the connection cost.
+
+## ADR-4: Auto-Fallback for Persistent-Connection Devices
+
+**Context:** `page.goto(url, wait_until="networkidle")` requires 500ms of
+zero network activity. Some devices (Compal CH7465MT) keep persistent
+polling/heartbeat connections, so `networkidle` never resolves. The
+`wait-for-data` mechanism (2s of zero pending XHR/fetch) has the same
+problem.
+
+**Decision:** Auto-detect and fall back. The initial `page.goto` uses
+`networkidle` with a 15-second timeout. If it times out (the definitive
+signal that the device has persistent connections), the system:
+
+1. Falls back to `domcontentloaded` (the page is already loaded — the wait
+   condition failed, not the navigation)
+1. Disables quiescence checks for the rest of the session
+1. Logs the fallback so the user knows what happened
+
+This is the same pattern as protocol negotiation — try the better option,
+catch the definitive failure, fall back. No heuristics, no guessing.
+
+Normal devices resolve `networkidle` in under 5 seconds. The 15-second
+timeout gives headroom for slow devices while catching persistent-connection
+devices without excessive wait.
+
+**Consequence:** The user never needs to know about page load strategies.
+The tool auto-adapts. `--minimal` remains as an escape hatch for the rare
+case where even the auto-fallback's 15-second wait is unacceptable, or
+where the connectivity check's single GET exhausts the device's session.
+
+## ADR-5: Domain-Agnostic Core, Domain Knowledge via Data
+
+**Context:** har-capture serves multiple consumers (cable modem monitor,
+printer admin panels, IoT hubs, SaaS dashboards). Device-specific knowledge
+(safe values, heuristic detectors, HTML scanner config) varies across
+domains.
+
+**Decision:** The sanitization engine has no knowledge of any particular
+device. Domain knowledge is loaded from JSON pattern files at runtime via
+`--patterns`. Core pattern files (`pii.json`, `sensitive.json`,
+`allowlist.json`) contain only universal PII rules.
+
+**Consequence:** Adding support for a new product category requires a JSON
+file, not code changes. Consumers ship their own pattern files.
+
+## ADR-6: Two-Pass Sanitization Model
+
+**Context:** Automated PII detection has false positives. Aggressive
+auto-redaction can destroy debugging utility. Conservative detection misses
+real PII.
+
+**Decision:** Pass 1 auto-redacts high-confidence PII (MACs, IPs, emails,
+passwords, tokens). Pass 2 presents ambiguous values for interactive review
+— the user sees the value, its context, why it was flagged, and decides
+whether to redact.
+
+**Consequence:** The tool is safe by default (Pass 1 catches universal PII)
+while giving the user control over edge cases. Non-interactive mode
+(CI/headless) writes flagged values to a JSON report instead.
+
+## ADR-7: XML POST Bodies Are Sanitized via Two Layers
+
+**Context:** Devices with XML APIs (e.g., Compal CH7465MT) send POST
+bodies with `text/xml` or `application/xml` MIME types containing session
+tokens, encrypted credentials, and device data.
+
+**Decision:** XML POST body sanitization uses two layers:
+
+1. `_sanitize_xml_fields()` — Parses XML, checks element tag names and
+   attribute names against sensitive field patterns, redacts matching values.
+   This mirrors how the JSON and form-urlencoded handlers check field names.
+1. `sanitize_html()` — The existing 17-pass scanner runs on the XML text
+   to catch pattern-based PII (MACs, IPs, emails) that field-name checking
+   misses.
+
+**Consequence:** Both field-name-based and pattern-based PII are caught.
+The HTML scanner already handles XML content (used for `text/xml`
+responses), so no new engine is needed. Malformed XML falls through
+gracefully.
+
+## ADR-8: Duplicate Connectivity Check Eliminated
+
+**Context:** `capture_device_har()` internally called
+`check_device_connectivity()` to determine the URL scheme. But the CLI
+workflow already called this in Phase 2. The result was two identical GET
+requests before the browser opened.
+
+**Decision:** The CLI now passes the pre-computed `target_url` to
+`capture_device_har()`. When provided, the internal connectivity check is
+skipped. When called directly (library API without CLI), the check still
+runs.
+
+**Consequence:** One fewer pre-flight HTTP request for all capture modes.
+Library API backward-compatible (new parameter has a `None` default).
+
+## ADR-9: Session Contamination Guard
+
+**Context:** 6 of 36 catalog HARs in the MCP intake pipeline failed
+validation because the browser had an existing session when capture
+started. Failure signatures: first request carries `Secure`,
+`XSRF_TOKEN`, `PHPSESSID` session cookies, or `Authorization` headers.
+The login flow is missing, making the HAR useless for auth analysis.
+
+**Decision:** Three layered defenses, in priority order:
+
+1. **Force clean browser context.** `storage_state={"cookies": [], "origins": []}` is set on every Playwright context. This prevents
+   all cookie/credential inheritance regardless of how the browser was
+   launched. This single change prevents all 6 failure signatures.
+
+1. **Pre-flight session check.** Before launching Playwright, an
+   unauthenticated GET checks whether the device serves data content
+   (no login page). If so, the device has a live session from another
+   source (another tab, previous connection from the same IP), and the
+   workflow aborts with a clear message. Skipped in `--minimal` mode.
+
+1. **Pre-capture cookie audit.** `context.cookies()` is called
+   immediately after context creation and before any navigation. The
+   result is emitted as `_solentlabs.pre_capture_cookies` in the HAR.
+   With the clean storage state, this should always be empty — a
+   non-empty list is a diagnostic signal for downstream tools.
+
+**Consequence:** The default workflow adds one pre-flight GET (session
+check). `--minimal` skips it for session-constrained devices. The
+pre-capture cookie audit has zero network cost — it reads local context
+state. All three defenses are additive and composable.