Skip to content

Commit 43ba07b

Browse files
authored
feat: v0.3.2 - HeuristicMode enum and custom patterns dict support
BREAKING CHANGE: Replace flag_suspicious parameter with heuristics: HeuristicMode This release adds fine-grained control over heuristic detection behavior with three modes (DISABLED, FLAG, REDACT) and enables custom patterns from dict. Added: - HeuristicMode enum (DISABLED/FLAG/REDACT) for controlling heuristic behavior - hash_sensitive_value() method for category-aware hashing - Custom patterns from dict support (no temp files needed) - 22 new tests for HeuristicMode functionality Changed: - BREAKING: Replace flag_suspicious bool with heuristics: HeuristicMode - Custom patterns now applied first (precedence over generic patterns) - Account ID and heuristics skip already-redacted values Fixed: - Custom patterns were loaded but never applied - Account ID pattern re-hashing custom-redacted values - Heuristics re-detecting values with multi-underscore prefixes Closes #17
1 parent 5921de8 commit 43ba07b

17 files changed

Lines changed: 695 additions & 135 deletions

File tree

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,31 @@
1-
---
2-
name: Bug report
3-
about: Report a bug in har-capture
4-
title: ''
5-
labels: bug
6-
assignees: ''
7-
---
1+
______________________________________________________________________
2+
3+
## name: Bug report about: Report a bug in har-capture title: '' labels: bug assignees: ''
84

95
## Description
6+
107
A clear description of the bug.
118

129
## Steps to Reproduce
10+
11+
1.
12+
1.
1313
1.
14-
2.
15-
3.
1614

1715
## Expected Behavior
16+
1817
What you expected to happen.
1918

2019
## Actual Behavior
20+
2121
What actually happened.
2222

2323
## Environment
24+
2425
- har-capture version:
2526
- Python version:
2627
- OS:
2728

2829
## Additional Context
30+
2931
Any other context, error messages, or screenshots.
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
1-
---
2-
name: Feature request
3-
about: Suggest a new feature or enhancement
4-
title: ''
5-
labels: enhancement
6-
assignees: ''
7-
---
1+
______________________________________________________________________
2+
3+
## name: Feature request about: Suggest a new feature or enhancement title: '' labels: enhancement assignees: ''
84

95
## Problem Statement
6+
107
What problem does this feature solve?
118

129
## Proposed Solution
10+
1311
How would you like this to work?
1412

1513
## Alternatives Considered
14+
1615
Any alternative solutions or workarounds you've considered.
1716

1817
## Additional Context
18+
1919
Any other context, examples, or mockups.

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,23 @@
11
## Summary
2+
23
Brief description of changes.
34

45
## Changes
6+
57
-
68

79
## Testing
10+
811
- [ ] Tests pass locally (`pytest`)
912
- [ ] Linting passes (`ruff check .`)
1013
- [ ] Type checking passes (`mypy src/`)
1114

1215
## Related Issues
16+
1317
Related to #
1418

1519
## Checklist
20+
1621
- [ ] Code follows project style guidelines
1722
- [ ] Self-review completed
1823
- [ ] Documentation updated (if applicable)

CHANGELOG.md

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,79 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.2] - 2026-02-06
11+
12+
### Added
13+
14+
- **HeuristicMode Enum** - Fine-grained control over heuristic behavior with three modes:
15+
- `DISABLED` (default) - Skip heuristics, only redact known patterns (safe, backward compatible)
16+
- `FLAG` - Flag suspicious values for manual review (interactive mode)
17+
- `REDACT` - Auto-redact suspicious values (automated workflows)
18+
- **Custom Patterns from Dict** - `custom_patterns` now accepts dict or file path
19+
- Enables passing patterns directly from modem.yaml or other configs
20+
- No need to write temporary files for API integration
21+
- **Category-Aware Hashing** - New `hash_sensitive_value()` method for heuristically-detected values
22+
- Generates prefixed hashes: `WIFI_xxxxx`, `CRED_xxxxx`, `DEVICE_xxxxx`
23+
- Preserves correlation while indicating detection category
24+
25+
### Changed
26+
27+
- **BREAKING**: Replaced `flag_suspicious` boolean with `heuristics: HeuristicMode` parameter
28+
- Old: `sanitize_har(data, flag_suspicious=True)`
29+
- New: `sanitize_har(data, heuristics=HeuristicMode.FLAG)`
30+
- Affects: `sanitize_html()`, `sanitize_har()`, `sanitize_har_file()`, CLI, browser capture
31+
- **Custom Pattern Precedence** - Custom patterns now applied first, preventing generic patterns from overriding them
32+
- **Already-Redacted Protection** - Account ID and heuristic patterns skip already-redacted values (e.g., `MODEM_SN_xxxxx`)
33+
34+
### Fixed
35+
36+
- **Custom Patterns Not Applied** - Custom patterns were loaded but never applied during sanitization
37+
- **Pattern Re-Redaction** - Account ID pattern no longer re-hashes custom-redacted values
38+
- **Multi-Underscore Prefixes** - Heuristics now correctly skip prefixes with underscores (e.g., `MODEM_SN_`)
39+
40+
### Migration Guide
41+
42+
**For API Users:**
43+
44+
```python
45+
# Before (v0.3.1)
46+
from har_capture.sanitization.har import sanitize_har_file
47+
sanitize_har_file(path, flag_suspicious=True)
48+
49+
# After (v0.3.2)
50+
from har_capture.sanitization.har import sanitize_har_file
51+
from har_capture.sanitization.report import HeuristicMode
52+
53+
# Interactive mode (manual review)
54+
sanitize_har_file(path, heuristics=HeuristicMode.FLAG)
55+
56+
# Automated mode (auto-redact, may over-redact)
57+
sanitize_har_file(path, heuristics=HeuristicMode.REDACT)
58+
59+
# Safe mode (default, only known patterns)
60+
sanitize_har_file(path, heuristics=HeuristicMode.DISABLED)
61+
# or simply: sanitize_har_file(path)
62+
```
63+
64+
**For cable_modem_monitor Integration:**
65+
66+
```python
67+
# Pass custom patterns as dict
68+
modem_patterns = {
69+
"patterns": {
70+
"modem_serial": {
71+
"regex": r"SN[0-9]{10}",
72+
"replacement_prefix": "MODEM_SN"
73+
}
74+
}
75+
}
76+
sanitize_har_file(
77+
path,
78+
heuristics=HeuristicMode.REDACT,
79+
custom_patterns=modem_patterns # Dict instead of file path
80+
)
81+
```
82+
1083
## [0.3.1] - 2026-02-04
1184

1285
### Fixed
@@ -26,7 +99,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2699
- **Improved Instructions** - Clearer checkbox prompts ("Enter when done", pre-selected items noted)
27100
- **Better UX** - Simplified keybindings (A/N for all/none work now), removed redundant Ctrl+C mention
28101

29-
## \[0.3.0\] - 2026-02-04
102+
## [0.3.0] - 2026-02-04
30103

31104
### Added
32105

@@ -210,4 +283,7 @@ har-capture sanitize input.har --patterns custom-allowlist.json
210283
[0.2.3]: https://github.com/solentlabs/har-capture/compare/v0.2.2...v0.2.3
211284
[0.2.4]: https://github.com/solentlabs/har-capture/compare/v0.2.3...v0.2.4
212285
[0.2.5]: https://github.com/solentlabs/har-capture/compare/v0.2.4...v0.2.5
213-
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.2.5...HEAD
286+
[0.3.0]: https://github.com/solentlabs/har-capture/compare/v0.2.5...v0.3.0
287+
[0.3.1]: https://github.com/solentlabs/har-capture/compare/v0.3.0...v0.3.1
288+
[0.3.2]: https://github.com/solentlabs/har-capture/compare/v0.3.1...v0.3.2
289+
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.3.2...HEAD

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ pip install har-capture[full]
141141

142142
```python
143143
from har_capture.sanitization import sanitize_html, sanitize_har
144+
from har_capture.sanitization.report import HeuristicMode
144145

145146
# Sanitize HTML (correlation-preserving by default)
146147
clean_html = sanitize_html(raw_html)
@@ -151,9 +152,26 @@ clean_html = sanitize_html(raw_html, salt="my-secret-key")
151152
# Use static placeholders (legacy mode)
152153
clean_html = sanitize_html(raw_html, salt=None)
153154

155+
# Enable heuristic detection for WiFi credentials, SSIDs, device names
156+
# DISABLED (default): Only redact known patterns
157+
# FLAG: Flag suspicious values for manual review
158+
# REDACT: Auto-redact suspicious values (may over-redact)
159+
clean_html = sanitize_html(raw_html, heuristics=HeuristicMode.REDACT)
160+
154161
# Sanitize HAR file
155162
from har_capture.sanitization import sanitize_har_file
156163
sanitize_har_file("capture.har") # Creates capture.sanitized.har
164+
165+
# Pass custom patterns as dict (e.g., from modem.yaml)
166+
custom_patterns = {
167+
"patterns": {
168+
"modem_serial": {
169+
"regex": r"SN[0-9]{10}",
170+
"replacement_prefix": "MODEM_SN"
171+
}
172+
}
173+
}
174+
sanitize_har_file("capture.har", custom_patterns=custom_patterns)
157175
```
158176

159177
### CLI

SECURITY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ If you discover a security vulnerability in har-capture, please report it privat
77
**[Report a vulnerability](https://github.com/solentlabs/har-capture/security/advisories/new)**
88

99
Please include:
10+
1011
- Description of the vulnerability
1112
- Steps to reproduce
1213
- Potential impact
@@ -21,6 +22,7 @@ Please include:
2122
## Scope
2223

2324
This policy covers:
25+
2426
- PII leakage in sanitization (patterns missing sensitive data)
2527
- Credential exposure in HAR files
2628
- Code injection via malicious HAR input

docs/RESEARCH.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@
1414

1515
### Alternative Tools Evaluated
1616

17-
| Project | Language | Type | Notes |
18-
|---------|----------|------|-------|
19-
| [Google har-sanitizer](https://github.com/google/har-sanitizer) | Python/JS | Web UI + REST API | No CLI, needs tests |
20-
| [Cloudflare har-sanitizer](https://blog.cloudflare.com/introducing-har-sanitizer-secure-har-sharing/) | JS | Web UI | JWT-focused |
21-
| [Edgio/har-tools](https://github.com/Edgio/har-tools) | JS | Web UI | Drag-drop interface |
22-
| [AbregaInc/har-cleaner](https://github.com/AbregaInc/har-cleaner) | TypeScript | Library | Jira integration |
23-
| [jfromaniello/har-sanitizer](https://github.com/jfromaniello/har-sanitizer) | JS | Library | Basic sanitization |
17+
| Project | Language | Type | Notes |
18+
| ----------------------------------------------------------------------------------------------------- | ---------- | ----------------- | ------------------- |
19+
| [Google har-sanitizer](https://github.com/google/har-sanitizer) | Python/JS | Web UI + REST API | No CLI, needs tests |
20+
| [Cloudflare har-sanitizer](https://blog.cloudflare.com/introducing-har-sanitizer-secure-har-sharing/) | JS | Web UI | JWT-focused |
21+
| [Edgio/har-tools](https://github.com/Edgio/har-tools) | JS | Web UI | Drag-drop interface |
22+
| [AbregaInc/har-cleaner](https://github.com/AbregaInc/har-cleaner) | TypeScript | Library | Jira integration |
23+
| [jfromaniello/har-sanitizer](https://github.com/jfromaniello/har-sanitizer) | JS | Library | Basic sanitization |
2424

2525
### Related Projects
2626

27-
| Project | Purpose |
28-
|---------|---------|
27+
| Project | Purpose |
28+
| ---------------------------------------------------------------------------------------------------------------------- | ------------------------------- |
2929
| [GSMA TSG Diagnostic Interface](https://github.com/GSMATerminals/TSG-IoT-devices-Standard-Diagnostic-Interface-Public) | Modem logging for Cat-M1/NB-IoT |
30-
| [NYU IoT Inspector](https://github.com/nyu-mlab/iot-inspector-client) | Smart home traffic analysis |
31-
| [IoTShark](https://github.com/sahilmgandhi/IotShark) | IoT traffic monitoring |
30+
| [NYU IoT Inspector](https://github.com/nyu-mlab/iot-inspector-client) | Smart home traffic analysis |
31+
| [IoTShark](https://github.com/sahilmgandhi/IotShark) | IoT traffic monitoring |

src/har_capture/capture/browser.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
install_browser_deps,
2828
)
2929
from har_capture.patterns import get_bloat_extensions
30+
from har_capture.sanitization.report import HeuristicMode
3031

3132
_LOGGER = logging.getLogger(__name__)
3233

@@ -436,7 +437,7 @@ def _cleanup_temp() -> None:
436437
_, sanitization_report = sanitize_har_file(
437438
str(temp_path),
438439
str(sanitized_output),
439-
flag_suspicious=interactive,
440+
heuristics=HeuristicMode.FLAG if interactive else HeuristicMode.DISABLED,
440441
)
441442
result.sanitized_path = sanitized_output
442443
result.sanitization_report = sanitization_report

src/har_capture/cli/sanitize.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import typer
1212

1313
from har_capture.patterns import PatternLoadError
14+
from har_capture.sanitization.report import HeuristicMode
1415

1516

1617
def sanitize(
@@ -163,16 +164,17 @@ def sanitize(
163164
else:
164165
typer.echo(" (Non-interactive mode: proceeding anyway)")
165166

166-
# Enable flag_suspicious when interactive mode is requested (even without TTY)
167-
flag_suspicious = run_heuristics
167+
# Determine heuristics mode
168+
# Interactive mode enables heuristics for flagging suspicious values
169+
heuristics = HeuristicMode.FLAG if run_heuristics else HeuristicMode.DISABLED
168170

169171
result_path, sanitization_report = sanitize_har_file(
170172
str(input_file),
171173
output_path,
172174
salt=effective_salt,
173175
custom_patterns=custom_patterns,
174176
max_size=max_size_bytes,
175-
flag_suspicious=flag_suspicious,
177+
heuristics=heuristics,
176178
)
177179

178180
# Interactive review mode (requires TTY)

src/har_capture/patterns/README.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ This directory contains JSON configuration files for PII detection, sanitization
44

55
## Files
66

7-
| File | Purpose |
8-
|------|---------|
9-
| `pii.json` | PII detection patterns (MAC, IP, email, etc.) |
10-
| `sensitive.json` | Sensitive headers and form field patterns |
7+
| File | Purpose |
8+
| ---------------- | ------------------------------------------------ |
9+
| `pii.json` | PII detection patterns (MAC, IP, email, etc.) |
10+
| `sensitive.json` | Sensitive headers and form field patterns |
1111
| `allowlist.json` | Patterns for recognizing already-redacted values |
12-
| `capture.json` | File extensions to filter during capture |
12+
| `capture.json` | File extensions to filter during capture |
1313

1414
## File Schemas
1515

@@ -32,6 +32,7 @@ Defines regex patterns for detecting PII in content.
3232
```
3333

3434
**Fields:**
35+
3536
- `regex`: Python regex pattern
3637
- `replacement_prefix`: Prefix for hashed replacement (e.g., `MAC``02:xx:xx:xx:xx:xx`)
3738
- `flags`: Optional list of regex flags (`IGNORECASE`, `MULTILINE`, `DOTALL`)
@@ -57,6 +58,7 @@ Defines sensitive HTTP headers and form fields to redact.
5758
```
5859

5960
**Fields:**
61+
6062
- `headers.full_redact`: Headers to completely redact
6163
- `headers.cookie_redact`: Headers where cookie values are redacted but names preserved
6264
- `fields.patterns`: Regex patterns matching sensitive form field names
@@ -84,6 +86,7 @@ Defines patterns for recognizing already-redacted values (to avoid double-flaggi
8486
```
8587

8688
**Fields:**
89+
8790
- `static_placeholders.values`: Exact values produced when `salt=None`
8891
- `format_preserving_patterns`: Regex patterns for RFC-reserved ranges
8992
- `hash_prefixes.values`: Prefixes for non-format-preserving hashes (`PREFIX_xxxxxxxx`)
@@ -104,6 +107,7 @@ Defines file extensions to filter during HAR capture.
104107
```
105108

106109
**Fields:**
110+
107111
- Categories can be selectively included via CLI flags (`--include-fonts`, etc.)
108112

109113
## Custom Patterns
@@ -177,8 +181,8 @@ sanitize_har(har_data, custom_patterns="your_project/patterns/modem.json")
177181
To add patterns to the core library:
178182

179183
1. Patterns should be **universally applicable** (not domain-specific)
180-
2. Include a clear `description` for each pattern
181-
3. Test patterns don't cause false positives on common data
182-
4. Submit a PR with examples of what the pattern matches
184+
1. Include a clear `description` for each pattern
185+
1. Test patterns don't cause false positives on common data
186+
1. Submit a PR with examples of what the pattern matches
183187

184188
For vendor or domain-specific patterns, maintain them in your own project and pass via `custom_patterns`.

0 commit comments

Comments
 (0)