|
| 1 | +# Architecture Decisions |
| 2 | + |
| 3 | +Design rationale and "why" behind architectural choices. Each decision |
| 4 | +records the context, the choice made, and the reasoning — so future |
| 5 | +contributors understand the tradeoffs rather than just the outcome. |
| 6 | + |
| 7 | +## ADR-1: Capture is User-Driven, Not Automated |
| 8 | + |
| 9 | +**Context:** har-capture's primary purpose is to help a user sanitize and |
| 10 | +package observed browser traffic so a downstream system (like |
| 11 | +cable_modem_monitor) can reverse-engineer device APIs. The user navigates |
| 12 | +the device's web interface, logs in, visits pages — the tool records |
| 13 | +everything. |
| 14 | + |
| 15 | +**Decision:** The default capture mode is interactive. The user drives the |
| 16 | +browser. har-capture records, sanitizes, and packages. |
| 17 | + |
| 18 | +**Consequence:** The tool should not attempt to automate device interaction |
| 19 | +(login, navigation) in the default path. Automated/headless mode exists for |
| 20 | +CI and advanced users but is not the primary use case. |
| 21 | + |
| 22 | +## ADR-2: Minimal Pre-Flight in Interactive Mode |
| 23 | + |
| 24 | +**Context:** The capture workflow originally ran 5 HTTP requests before |
| 25 | +opening Playwright: connectivity check, auth challenge probe, HEAD probe, |
| 26 | +ICMP ping, and auth detection. For devices that allow only one concurrent |
| 27 | +session (e.g., Compal CH7465MT), these requests exhaust the session slot |
| 28 | +before the browser opens. |
| 29 | + |
| 30 | +**Decision:** Interactive mode should make the fewest possible pre-flight |
| 31 | +HTTP requests. The browser handles auth dialogs, redirects, and errors |
| 32 | +naturally — the user is present to respond. |
| 33 | + |
| 34 | +- **Connectivity check (1 GET):** Retained. Validates the device is reachable |
| 35 | + on the user-provided scheme before launching Playwright. The URL must |
| 36 | + include an explicit `http://` or `https://` scheme. Without the check, |
| 37 | + Playwright would hang silently on unreachable devices. |
| 38 | +- **Auth detection:** Not needed in interactive mode. When a device responds |
| 39 | + with 401, Playwright shows a native Basic Auth dialog. The user enters |
| 40 | + credentials. Both the 401 and the authenticated retry are captured in the |
| 41 | + HAR — the full auth exchange is recorded. |
| 42 | +- **Probes (auth challenge, HEAD, ICMP):** Not needed in interactive mode. |
| 43 | + Probe data (401 headers, Set-Cookie) is captured naturally in the HAR when |
| 44 | + the browser navigates. Probes were added because Playwright's |
| 45 | + `http_credentials` suppresses the 401 — but interactive mode does not use |
| 46 | + `http_credentials`. |
| 47 | + |
| 48 | +**`--minimal` flag:** For the edge case where even the single connectivity |
| 49 | +check is problematic, `--minimal` reduces pre-flight further: |
| 50 | +skips probes, skips auth detection, uses `domcontentloaded` page load |
| 51 | +strategy, disables wait-for-data. |
| 52 | + |
| 53 | +**Headless/automated mode** (`--headless --timeout N`) is the exception: no |
| 54 | +human is present, so auth detection and probes are necessary. The user |
| 55 | +requesting headless mode implicitly accepts the connection overhead. |
| 56 | + |
| 57 | +**Consequence:** The default interactive capture goes from 5 pre-flight |
| 58 | +HTTP requests to 1. Most devices work without any flags. |
| 59 | + |
| 60 | +## ADR-3: Probes Are Opt-In Diagnostics |
| 61 | + |
| 62 | +**Context:** Pre-capture probes capture the device's 401 response, |
| 63 | +`WWW-Authenticate` headers, and `Set-Cookie` data. cable_modem_monitor's |
| 64 | +intake pipeline uses this to reverse-engineer auth patterns. But the probe |
| 65 | +data is also present in the HAR itself (the browser's first request to a |
| 66 | +401 endpoint is recorded). |
| 67 | + |
| 68 | +**Decision:** Probes are opt-in via `--diagnostics`. The HAR already |
| 69 | +contains the auth exchange from the browser's natural interaction. Probes |
| 70 | +add a pre-Playwright snapshot that's useful when `http_credentials` |
| 71 | +suppresses the 401, which only happens in automated mode. |
| 72 | + |
| 73 | +**Consequence:** har-capture stays domain-agnostic. Downstream consumers |
| 74 | +(CMM intake) request `--diagnostics` when they need probe metadata. Default |
| 75 | +users don't pay the connection cost. |
| 76 | + |
| 77 | +## ADR-4: Auto-Fallback for Persistent-Connection Devices |
| 78 | + |
| 79 | +**Context:** `page.goto(url, wait_until="networkidle")` requires 500ms of |
| 80 | +zero network activity. Some devices (Compal CH7465MT) keep persistent |
| 81 | +polling/heartbeat connections, so `networkidle` never resolves. The |
| 82 | +`wait-for-data` mechanism (2s of zero pending XHR/fetch) has the same |
| 83 | +problem. |
| 84 | + |
| 85 | +**Decision:** Auto-detect and fall back. The initial `page.goto` uses |
| 86 | +`networkidle` with a 15-second timeout. If it times out (the definitive |
| 87 | +signal that the device has persistent connections), the system: |
| 88 | + |
| 89 | +1. Falls back to `domcontentloaded` (the page is already loaded — the wait |
| 90 | + condition failed, not the navigation) |
| 91 | +1. Disables quiescence checks for the rest of the session |
| 92 | +1. Logs the fallback so the user knows what happened |
| 93 | + |
| 94 | +This is the same pattern as protocol negotiation — try the better option, |
| 95 | +catch the definitive failure, fall back. No heuristics, no guessing. |
| 96 | + |
| 97 | +Normal devices resolve `networkidle` in under 5 seconds. The 15-second |
| 98 | +timeout gives headroom for slow devices while catching persistent-connection |
| 99 | +devices without excessive wait. |
| 100 | + |
| 101 | +**Consequence:** The user never needs to know about page load strategies. |
| 102 | +The tool auto-adapts. `--minimal` remains as an escape hatch for the rare |
| 103 | +case where even the auto-fallback's 15-second wait is unacceptable, or |
| 104 | +where the connectivity check's single GET exhausts the device's session. |
| 105 | + |
| 106 | +## ADR-5: Domain-Agnostic Core, Domain Knowledge via Data |
| 107 | + |
| 108 | +**Context:** har-capture serves multiple consumers (cable modem monitor, |
| 109 | +printer admin panels, IoT hubs, SaaS dashboards). Device-specific knowledge |
| 110 | +(safe values, heuristic detectors, HTML scanner config) varies across |
| 111 | +domains. |
| 112 | + |
| 113 | +**Decision:** The sanitization engine has no knowledge of any particular |
| 114 | +device. Domain knowledge is loaded from JSON pattern files at runtime via |
| 115 | +`--patterns`. Core pattern files (`pii.json`, `sensitive.json`, |
| 116 | +`allowlist.json`) contain only universal PII rules. |
| 117 | + |
| 118 | +**Consequence:** Adding support for a new product category requires a JSON |
| 119 | +file, not code changes. Consumers ship their own pattern files. |
| 120 | + |
| 121 | +## ADR-6: Two-Pass Sanitization Model |
| 122 | + |
| 123 | +**Context:** Automated PII detection has false positives. Aggressive |
| 124 | +auto-redaction can destroy debugging utility. Conservative detection misses |
| 125 | +real PII. |
| 126 | + |
| 127 | +**Decision:** Pass 1 auto-redacts high-confidence PII (MACs, IPs, emails, |
| 128 | +passwords, tokens). Pass 2 presents ambiguous values for interactive review |
| 129 | +— the user sees the value, its context, why it was flagged, and decides |
| 130 | +whether to redact. |
| 131 | + |
| 132 | +**Consequence:** The tool is safe by default (Pass 1 catches universal PII) |
| 133 | +while giving the user control over edge cases. Non-interactive mode |
| 134 | +(CI/headless) writes flagged values to a JSON report instead. |
| 135 | + |
| 136 | +## ADR-7: XML POST Bodies Are Sanitized via Two Layers |
| 137 | + |
| 138 | +**Context:** Devices with XML APIs (e.g., Compal CH7465MT) send POST |
| 139 | +bodies with `text/xml` or `application/xml` MIME types containing session |
| 140 | +tokens, encrypted credentials, and device data. |
| 141 | + |
| 142 | +**Decision:** XML POST body sanitization uses two layers: |
| 143 | + |
| 144 | +1. `_sanitize_xml_fields()` — Parses XML, checks element tag names and |
| 145 | + attribute names against sensitive field patterns, redacts matching values. |
| 146 | + This mirrors how the JSON and form-urlencoded handlers check field names. |
| 147 | +1. `sanitize_html()` — The existing 17-pass scanner runs on the XML text |
| 148 | + to catch pattern-based PII (MACs, IPs, emails) that field-name checking |
| 149 | + misses. |
| 150 | + |
| 151 | +**Consequence:** Both field-name-based and pattern-based PII are caught. |
| 152 | +The HTML scanner already handles XML content (used for `text/xml` |
| 153 | +responses), so no new engine is needed. Malformed XML falls through |
| 154 | +gracefully. |
| 155 | + |
| 156 | +## ADR-8: Duplicate Connectivity Check Eliminated |
| 157 | + |
| 158 | +**Context:** `capture_device_har()` internally called |
| 159 | +`check_device_connectivity()` to determine the URL scheme. But the CLI |
| 160 | +workflow already called this in Phase 2. The result was two identical GET |
| 161 | +requests before the browser opened. |
| 162 | + |
| 163 | +**Decision:** The CLI now passes the pre-computed `target_url` to |
| 164 | +`capture_device_har()`. When provided, the internal connectivity check is |
| 165 | +skipped. When called directly (library API without CLI), the check still |
| 166 | +runs. |
| 167 | + |
| 168 | +**Consequence:** One fewer pre-flight HTTP request for all capture modes. |
| 169 | +Library API backward-compatible (new parameter has a `None` default). |
| 170 | + |
| 171 | +## ADR-9: Session Contamination Guard |
| 172 | + |
| 173 | +**Context:** 6 of 36 catalog HARs in the MCP intake pipeline failed |
| 174 | +validation because the browser had an existing session when capture |
| 175 | +started. Failure signatures: first request carries `Secure`, |
| 176 | +`XSRF_TOKEN`, `PHPSESSID` session cookies, or `Authorization` headers. |
| 177 | +The login flow is missing, making the HAR useless for auth analysis. |
| 178 | + |
| 179 | +**Decision:** Three layered defenses, in priority order: |
| 180 | + |
| 181 | +1. **Force clean browser context.** `storage_state={"cookies": [], "origins": []}` is set on every Playwright context. This prevents |
| 182 | + all cookie/credential inheritance regardless of how the browser was |
| 183 | + launched. This single change prevents all 6 failure signatures. |
| 184 | + |
| 185 | +1. **Pre-flight session check.** Before launching Playwright, an |
| 186 | + unauthenticated GET checks whether the device serves data content |
| 187 | + (no login page). If so, the device has a live session from another |
| 188 | + source (another tab, previous connection from the same IP), and the |
| 189 | + workflow aborts with a clear message. Skipped in `--minimal` mode. |
| 190 | + |
| 191 | +1. **Pre-capture cookie audit.** `context.cookies()` is called |
| 192 | + immediately after context creation and before any navigation. The |
| 193 | + result is emitted as `_solentlabs.pre_capture_cookies` in the HAR. |
| 194 | + With the clean storage state, this should always be empty — a |
| 195 | + non-empty list is a diagnostic signal for downstream tools. |
| 196 | + |
| 197 | +**Consequence:** The default workflow adds one pre-flight GET (session |
| 198 | +check). `--minimal` skips it for session-constrained devices. The |
| 199 | +pre-capture cookie audit has zero network cost — it reads local context |
| 200 | +state. All three defenses are additive and composable. |
0 commit comments