Skip to content

Commit a54668e

Browse files
kwschulzclaude
andauthored
feat(cli): default command, always-on interactive, v0.5.0 (#34)
* feat(patterns): domain-driven pattern extensibility, wait-for-data SPA capture Refactor the sanitization engine to be fully domain-agnostic. Heuristic detectors (SSID, device name) and safe-value patterns are now data-driven via domain JSON files loaded at runtime, replacing hardcoded Python functions. Add network-device as the first built-in domain. Add wait-for-data mechanism for SPA-style device UIs that load data via async XHR/fetch after page render. Monkey-patches XMLHttpRequest.send and window.fetch to track in-flight requests, polls for 2s network quiescence (vs Playwright's 500ms networkidle), and uses a framenavigated listener to wait between page transitions. Extract inline test data to JSON fixtures in tests/fixtures/ for all test modules with large data tables (heuristics, har, html, secrets, redaction, browser). Test files now load from fixtures via parametrize. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(cli): default command, always-on interactive review, v0.5.0 - `har-capture <URL>` works without typing `get` (falls back via _DefaultGetGroup when first arg isn't a known subcommand) - Interactive review always enabled; `--no-interactive` removed from both `get` and `sanitize` commands. Non-TTY falls back to report. - API defaults changed: `interactive=True` in capture_device_har() and run_capture_phase() - Version bump to 0.5.0 - Documentation suite: architecture doc, 4 specs, use cases, CLI reference verified against implementation (12 HIGH findings fixed) - Spec accuracy fixes: function signatures, phase ordering, scanner pass numbering, removed unimplemented features (_extends, html domain section), clarified API-only params (timeout, headless) - RAW_DATA/ added to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add release pipeline checklist to CLAUDE.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ken Schulz <kwschulz@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 06e1dba commit a54668e

42 files changed

Lines changed: 6246 additions & 1371 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ Thumbs.db
101101
*.har.gz
102102
captures/
103103
.secrets.baseline
104+
RAW_DATA/
104105

105106
# Ruff
106107
.ruff_cache/

CHANGELOG.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.5.0] - 2026-03-29
11+
12+
### Added
13+
14+
- **Default command**`har-capture <URL>` now works without typing `get` (e.g., `har-capture 192.168.1.1`). The `get` subcommand still works as an explicit alias.
15+
- **Domain-driven pattern extensibility** — heuristic detectors (`CompiledDetector`), safe value patterns, and pipe-delimited variable matching are now data-driven via domain pattern files loaded with `--patterns`. See [Pattern Spec](docs/specs/PATTERN_SPEC.md).
16+
- **Wait-for-data SPA capture** — JavaScript init script monkey-patches `XMLHttpRequest.send` and `window.fetch` to track in-flight requests. Polls for 2 seconds of network quiescence (vs Playwright's 500ms `networkidle`). `framenavigated` listener ensures async data completes before page transitions.
17+
- **Test fixture extraction** — large test data moved from inline to `tests/fixtures/*.json`
18+
19+
### Changed
20+
21+
- **BREAKING**: Interactive review is now always enabled and cannot be disabled. The `--no-interactive` flag has been removed from both `get` and `sanitize` commands. In non-TTY environments (CI/CD), flagged values are written to a `.review.json` report file instead.
22+
- **BREAKING**: `capture_device_har()` and `run_capture_phase()` now default to `interactive=True` (was `False`). API consumers can still pass `interactive=False` explicitly.
23+
- Documentation suite rewritten — architecture doc, 4 specs, use cases, CLI reference all verified against implementation (76 findings resolved)
24+
25+
### Fixed
26+
27+
- 12 HIGH-severity documentation accuracy issues: wrong function signatures, wrong phase ordering, fabricated CLI flags (`--timeout`, `--headless`), fabricated pre-commit hook, wrong scanner pass numbering, unimplemented features documented as real (`_extends`, `html` domain section)
28+
1029
## [0.4.5] - 2026-03-09
1130

1231
### Fixed
@@ -416,4 +435,5 @@ har-capture sanitize input.har --patterns custom-allowlist.json
416435
[0.4.3]: https://github.com/solentlabs/har-capture/compare/v0.4.2...v0.4.3
417436
[0.4.4]: https://github.com/solentlabs/har-capture/compare/v0.4.3...v0.4.4
418437
[0.4.5]: https://github.com/solentlabs/har-capture/compare/v0.4.4...v0.4.5
419-
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.4.5...HEAD
438+
[0.5.0]: https://github.com/solentlabs/har-capture/compare/v0.4.5...v0.5.0
439+
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.5.0...HEAD

CLAUDE.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Claude Rules
2+
3+
> **This file**: Core principles, behavioral constraints, and development rules.
4+
> The principles section is the foundation — internalize it before any work.
5+
6+
## Core Principles
7+
8+
These principles govern every change to this project. They are not
9+
guidelines — they are hard constraints. When in doubt, the principle
10+
wins over convenience.
11+
12+
### Architecture
13+
14+
1. **Separation of Concerns is non-negotiable.** Each module does one
15+
thing. `patterns/` loads and merges patterns. `sanitization/` removes
16+
PII. `capture/` records traffic. `validation/` checks results.
17+
`cli/` wires commands. No module reaches across boundaries.
18+
19+
1. **DRY is non-negotiable.** If the same logic appears in 2+ places,
20+
extract a shared helper. Duplicated pattern loading, redaction
21+
checks, or detection logic are architecture bugs, not tech debt.
22+
23+
1. **The core library has no CLI dependency.** `cli/` is a thin
24+
wrapper over the library. If a CLI command requires non-trivial
25+
logic, it belongs in the library, not the CLI module. API consumers
26+
(`sanitize_har_file()`, `validate_har()`) never import from `cli/`.
27+
28+
1. **New features are additive only.** New domain pattern, new
29+
heuristic detector, new PII pattern, new scanner pass — none of
30+
these change existing code. Add a JSON file, register it, done.
31+
If adding a feature requires modifying unrelated modules, the
32+
architecture is wrong.
33+
34+
1. **The core is domain-agnostic.** The sanitization engine has no
35+
knowledge of any particular device or application. Domain-specific
36+
knowledge (safe values, detectors, HTML scanner config) lives in
37+
domain pattern files loaded at runtime via `--patterns`. Core
38+
pattern files (`pii.json`, `sensitive.json`, `allowlist.json`)
39+
contain only universal PII rules.
40+
41+
1. **Extensibility via data, not code.** Adding support for a new
42+
product category requires a JSON file, not code changes. Heuristic
43+
detectors, HTML scanners, PII patterns, and safe values are all
44+
configured through domain pattern files. If a domain pattern
45+
section requires code knowledge, the abstraction is wrong.
46+
47+
### Specs and Documentation
48+
49+
7. **Specs are the authority.** Code follows specs. No silent
50+
deviations. If the code needs to diverge, discuss the gap first,
51+
update the spec, then update the code.
52+
53+
1. **Design decisions land in specs, not in conversation.** Every
54+
architectural decision made during a session must be committed to
55+
the relevant spec or architecture doc before the session ends.
56+
Conversation history is ephemeral — specs are durable.
57+
58+
1. **Docs and code move together.** Every change reconciles the
59+
affected specs (ARCHITECTURE, CAPTURE_SPEC, SANITIZATION_SPEC,
60+
PATTERN_SPEC, VALIDATION_SPEC). A code change without a
61+
corresponding spec update is incomplete.
62+
63+
### Code Quality
64+
65+
10. **No shortcuts, no deferred structure.** If a better design is
66+
obvious, use it now. Don't optimise for speed of first draft.
67+
When a module grows past its natural boundary, restructure the
68+
whole module — don't bolt on the new thing and leave the rest.
69+
70+
01. **Quality gates are not negotiable.** If mypy, ruff, or pytest
71+
fails, fix the code. Don't exclude files, skip checks, or weaken
72+
thresholds. Never bypass pre-commit hooks — fix failures, don't
73+
skip them. If hooks break, fix the hook setup first.
74+
75+
01. **Test overrides are a code smell.** If reaching coverage requires
76+
heavy mocking, monkeypatching, or test overrides, the code
77+
structure is wrong. Restructure the code (extract dependency, make
78+
injectable), don't paper over it with test complexity.
79+
80+
### Testing
81+
82+
13. **Table-driven tests by default.** Identify the pattern BEFORE
83+
writing tests. If 3+ tests share the same setup→call→assert
84+
structure, start with `@pytest.mark.parametrize`.
85+
86+
01. **Test data lives in JSON fixtures.** No inline data blobs in
87+
test files. Large test data (dicts, pattern lists, test case
88+
tables) goes in `tests/fixtures/*.json`. Test files load fixtures
89+
and convert to tuples for parametrize. Schema tests use fixture
90+
files; behavioural tests stay inline.
91+
92+
01. **Coverage threshold is 75%.** Defined in `pyproject.toml`. Patch
93+
target 80% informational (`codecov.yml`). Don't game coverage —
94+
if a module is hard to test, restructure it.
95+
96+
### Process
97+
98+
16. **Only the developer merges PRs and takes irreversible actions.**
99+
Never merge a PR, force push, delete branches, or create releases
100+
without explicit approval. "Ready to merge?" is not "merge it."
101+
102+
01. **No external actions without discussion.** Never create GitHub
103+
issues, PRs, pushes, label changes, or any external-facing action
104+
without explicit discussion first.
105+
106+
01. **Conventional commits.** Commitizen pre-commit hook requires the
107+
format: `type(scope): message` (e.g., `feat(patterns):`,
108+
`fix(sanitization):`, `docs:`, `chore(release):`).
109+
110+
## Architecture and Specifications
111+
112+
| Document | Scope |
113+
| --------------------------------- | --------------------------------------------------------- |
114+
| `docs/ARCHITECTURE.md` | Design constraints, system shape, component relationships |
115+
| `docs/specs/CAPTURE_SPEC.md` | Playwright session, wait-for-data, workflow phases |
116+
| `docs/specs/SANITIZATION_SPEC.md` | HAR/HTML/heuristic engines, two-pass model, hasher |
117+
| `docs/specs/PATTERN_SPEC.md` | Pattern file schemas, domain files, merge order, loader |
118+
| `docs/specs/VALIDATION_SPEC.md` | PII leak detection, check functions, pre-commit hook |
119+
| `docs/USE_CASES.md` | User-facing use case catalog |
120+
121+
## Project Layout
122+
123+
```
124+
src/har_capture/
125+
├── patterns/ # Pattern loading, merging, redaction checking, hashing
126+
│ └── domains/ # Built-in domain pattern files (network_device.json)
127+
├── sanitization/ # HAR engine, HTML engine, heuristic engine
128+
├── capture/ # Playwright recording, wait-for-data (optional dep)
129+
├── validation/ # PII leak detection for pre-commit and CLI
130+
└── cli/ # Typer commands (get, sanitize, validate, patterns)
131+
tests/
132+
├── fixtures/ # JSON test data (one per test module)
133+
├── test_capture/
134+
├── test_sanitization/
135+
├── test_patterns/
136+
├── test_validation/
137+
└── test_cli/
138+
```
139+
140+
## Development
141+
142+
```bash
143+
# Run tests (excludes integration tests requiring Playwright)
144+
.venv/bin/python3 -m pytest tests/ -v --tb=short -m "not integration"
145+
146+
# Release (after merge to main)
147+
git checkout main && git pull && python scripts/release.py X.Y.Z
148+
```
149+
150+
## Release Flow
151+
152+
All work ships in **one PR**: code + tests + changelog + version bump.
153+
No separate release PR. No tagging from feature branches.
154+
155+
### PR Checklist (before merge)
156+
157+
- [ ] Version bumped in **both** `pyproject.toml` and `src/har_capture/__init__.py`
158+
- [ ] `CHANGELOG.md` has a `## [X.Y.Z] - YYYY-MM-DD` section with changes
159+
- [ ] `CHANGELOG.md` has a `[X.Y.Z]` comparison link at the bottom
160+
- [ ] `CHANGELOG.md` `[unreleased]` link updated to compare from `vX.Y.Z`
161+
- [ ] Tests pass: `.venv/bin/python3 -m pytest tests/ -v --tb=short -m "not integration"`
162+
- [ ] Pre-commit hooks pass: `.venv/bin/python3 -m pre_commit run --all-files`
163+
- [ ] Commit message follows conventional format: `type(scope): message`
164+
165+
### Pipeline: PR → PyPI
166+
167+
```
168+
1. Push to feature branch
169+
└─ No CI (only main + PRs trigger CI)
170+
171+
2. Create PR targeting main
172+
└─ ci.yml triggers: tests on Python 3.10-3.13, coverage + integration tests
173+
└─ PR must pass before merge
174+
175+
3. Merge PR to main (developer only)
176+
└─ ci.yml triggers again on the merge commit
177+
└─ This is the commit release.py will validate
178+
179+
4. Run release script (developer only)
180+
$ git checkout main && git pull
181+
$ python scripts/release.py X.Y.Z # or --dry-run first
182+
└─ Validates: on main, clean worktree, no existing tag
183+
└─ Validates: CI passed on HEAD commit (via GitHub API)
184+
└─ Validates: version consistent across pyproject.toml, __init__.py, CHANGELOG.md
185+
└─ Runs tests + ruff + mypy locally
186+
└─ Creates and pushes annotated tag vX.Y.Z
187+
188+
5. Tag push triggers three GitHub Actions workflows:
189+
├─ tag-protection.yml: verifies tag → main, CI passed, version matches
190+
├─ release.yml: extracts CHANGELOG section → creates GitHub Release
191+
└─ publish.yml: builds sdist+wheel → publishes to PyPI (trusted publishing)
192+
```
193+
194+
### CHANGELOG Format
195+
196+
Uses [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). Update in
197+
the same PR as the code change. Don't forget the comparison link at the
198+
bottom and the `[unreleased]` link update.
199+
200+
### Recovery
201+
202+
If tag push doesn't trigger workflows within ~60s, delete and re-push:
203+
204+
```bash
205+
git tag -d vX.Y.Z && git push origin :refs/tags/vX.Y.Z
206+
git tag -a vX.Y.Z -m "Release X.Y.Z" && git push origin vX.Y.Z
207+
```

README.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Capture and sanitize [HAR (HTTP Archive)](https://w3c.github.io/web-performance/
1818

1919
```bash
2020
pip install har-capture[full]
21-
python -m har_capture get https://example.com
21+
python -m har_capture https://example.com
2222
```
2323

2424
</details>
@@ -28,7 +28,7 @@ python -m har_capture get https://example.com
2828

2929
```bash
3030
pip install har-capture[full]
31-
har-capture get https://example.com
31+
har-capture https://example.com
3232
```
3333

3434
</details>
@@ -106,15 +106,12 @@ ______________________________________________________________________
106106
### Command Line
107107

108108
```bash
109-
# Capture and sanitize
110-
har-capture get https://example.com
109+
# Capture and sanitize (interactive review always enabled)
110+
har-capture https://example.com
111111

112112
# Sanitize existing HAR
113113
har-capture sanitize capture.har
114114

115-
# Interactive mode (review suspicious values)
116-
har-capture sanitize capture.har --interactive
117-
118115
# Validate for PII leaks
119116
har-capture validate capture.har
120117
```

0 commit comments

Comments
 (0)