Skip to content

Commit 4caace8

Browse files
kwschulzclaude
andauthored
fix(sanitization): custom_patterns propagation through sanitize_entry (v0.7.1) (#40)
* fix(sanitization): custom_patterns now propagates through sanitize_entry The 0.7.0 ContextVar-scoped override was entered only by sanitize_post_data and sanitize_html. Three detection sites in _sanitize_request / _sanitize_response that run before either of those — header-value matching (sanitize_header_value), structured queryString params, and URL query params (_sanitize_url_query_params) — silently ignored custom_patterns when callers used sanitize_entry / sanitize_har / sanitize_har_file. Fix: enter both ContextVar scopes at sanitize_entry, so every detection site within an entry sees the same extension set. Adds a parallel _HeaderSets frozen dataclass + _HEADER_SETS_CTX ContextVar + _header_sets_scope context manager + _resolve_header_sets resolver backed by a bounded LRU cache, mirroring the _FieldPatternSet infrastructure shipped in 0.7.0. sanitize_header_value now reads from the ContextVar instead of module-global sets, so custom headers.full_redact / headers.cookie_redact entries take effect end-to-end via the top-level entry points. Tests: +12. Header-sets internals (compile / resolve / cache / scope-restore on exception) + propagation through sanitize_entry and sanitize_har for each of the three gaps + scope isolation between consecutive entries. 1845 tests pass, 85.65% coverage. Held locally for the next release cycle (0.7.1). Not pushed. Closes the gap the reviewer flagged after 0.7.0 shipped: once we fixed the sanitize_html instance of "module-global predicate bypassed by custom_patterns," we should have swept the same class. sanitize_header_value + queryString + URL query params were structurally identical and got deferred as "nice-to-have" when they should have been included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump version to 0.7.1 Cuts [Unreleased] to [0.7.1] - 2026-04-24 and adds comparison links, following the release flow documented in CLAUDE.md. Single release PR (feature fix + version bump). Release notes summary: custom_patterns now propagates through sanitize_entry to all detection sites (headers, cookies, queryString, URL-query params, bodies, inline scripts), closing a security-adjacent gap in 0.7.0 where top-level entry-point consumers were silently getting unredacted data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ken Schulz <kwschulz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 74f3c77 commit 4caace8

6 files changed

Lines changed: 336 additions & 22 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.7.1] - 2026-04-24
11+
12+
### Fixed
13+
14+
- **`custom_patterns` now propagates through `sanitize_entry` to all detection sites** — In 0.7.0 the `ContextVar`-scoped override was entered only by `sanitize_post_data` and `sanitize_html`, so three detection sites in `_sanitize_request` / `_sanitize_response` that run before either of those — header-value matching (`sanitize_header_value`), structured `queryString` params, and URL query params (`_sanitize_url_query_params`) — silently ignored `custom_patterns` when callers used the top-level entry points (`sanitize_entry`, `sanitize_har`, `sanitize_har_file`). **Security-adjacent**: consumers passing `custom_patterns={"headers": {"full_redact": ["x-modem-auth"]}}` to `sanitize_har_file` were getting unredacted auth headers in their "sanitized" HAR. Fixed by entering both scopes at `sanitize_entry`, so every detection site within an entry sees the same extension set. Adds a parallel `_HeaderSets` dataclass + `_HEADER_SETS_CTX` ContextVar + `_header_sets_scope` / `_resolve_header_sets` resolver + cache so `sanitize_header_value` picks up custom `headers.full_redact` / `headers.cookie_redact` entries the same way field detection picks up custom `fields.auto_redact_patterns`. Module-global state still never mutated.
15+
1016
## [0.7.0] - 2026-04-24
1117

1218
### Added
@@ -489,4 +495,5 @@ har-capture sanitize input.har --patterns custom-allowlist.json
489495
[0.6.0]: https://github.com/solentlabs/har-capture/compare/v0.5.1...v0.6.0
490496
[0.6.1]: https://github.com/solentlabs/har-capture/compare/v0.6.0...v0.6.1
491497
[0.7.0]: https://github.com/solentlabs/har-capture/compare/v0.6.1...v0.7.0
492-
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.7.0...HEAD
498+
[0.7.1]: https://github.com/solentlabs/har-capture/compare/v0.7.0...v0.7.1
499+
[unreleased]: https://github.com/solentlabs/har-capture/compare/v0.7.1...HEAD

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "har-capture"
7-
version = "0.7.0"
7+
version = "0.7.1"
88
description = "HAR capture and PII sanitization library for network traffic analysis"
99
readme = "README.md"
1010
license = "MIT"

src/har_capture/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424

2525
from __future__ import annotations
2626

27-
__version__ = "0.7.0"
27+
__version__ = "0.7.1"
2828

2929
# Re-export public API for convenience
3030
from har_capture.sanitization import (

src/har_capture/sanitization/har.py

Lines changed: 108 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -285,6 +285,85 @@ def _resolve_field_patterns(
285285
return resolved
286286

287287

288+
# --- Header-set per-call override --------------------------------------------
289+
#
290+
# Parallel to _FieldPatternSet but for HTTP header names, which are matched by
291+
# lowercase exact-match against two sets (full_redact, cookie_redact) rather
292+
# than compiled regex. Same ContextVar / resolver / cache shape so the two
293+
# subsystems evolve together.
294+
295+
296+
@dataclass(frozen=True)
297+
class _HeaderSets:
298+
"""Resolved header-name sets used for one sanitization call.
299+
300+
Both sets are frozen and case-normalized to lowercase so callers can do
301+
``name.lower() in sets.full_redact`` without repeated normalization.
302+
"""
303+
304+
full_redact: frozenset[str]
305+
cookie_redact: frozenset[str]
306+
307+
308+
def _compile_header_sets(sensitive_data: dict[str, Any]) -> _HeaderSets:
309+
"""Build (full_redact, cookie_redact) frozensets from a loaded sensitive-patterns dict."""
310+
headers = sensitive_data.get("headers", {})
311+
full = frozenset(h.lower() for h in headers.get("full_redact", []))
312+
cookie = frozenset(h.lower() for h in headers.get("cookie_redact", []))
313+
return _HeaderSets(full, cookie)
314+
315+
316+
_DEFAULT_HEADER_SETS = _HeaderSets(frozenset(_FULL_REDACT_HEADERS), frozenset(_COOKIE_REDACT_HEADERS))
317+
318+
_HEADER_SETS_CTX: contextvars.ContextVar[_HeaderSets] = contextvars.ContextVar(
319+
"har_capture_header_sets", default=_DEFAULT_HEADER_SETS
320+
)
321+
322+
_CUSTOM_HEADER_SETS_CACHE: OrderedDict[str, _HeaderSets] = OrderedDict()
323+
324+
325+
def _resolve_header_sets(
326+
custom_patterns: str | dict[str, Any] | None,
327+
) -> _HeaderSets:
328+
"""Resolve the (full_redact, cookie_redact) header sets for this call.
329+
330+
``custom_patterns=None`` returns the shared default set (zero-cost).
331+
Otherwise the custom patterns are merged with built-ins via the loader
332+
and the compiled result is cached per canonical key.
333+
"""
334+
if custom_patterns is None:
335+
return _DEFAULT_HEADER_SETS
336+
337+
key = _custom_patterns_cache_key(custom_patterns)
338+
if key is not None:
339+
cached = _CUSTOM_HEADER_SETS_CACHE.get(key)
340+
if cached is not None:
341+
_CUSTOM_HEADER_SETS_CACHE.move_to_end(key)
342+
return cached
343+
344+
resolved = _compile_header_sets(load_sensitive_patterns(custom_patterns))
345+
346+
if key is not None:
347+
_CUSTOM_HEADER_SETS_CACHE[key] = resolved
348+
while len(_CUSTOM_HEADER_SETS_CACHE) > _CUSTOM_FIELD_RE_CACHE_MAX:
349+
_CUSTOM_HEADER_SETS_CACHE.popitem(last=False)
350+
return resolved
351+
352+
353+
@contextmanager
354+
def _header_sets_scope(custom_patterns: str | dict[str, Any] | None) -> Iterator[None]:
355+
"""Apply ``custom_patterns`` as the active header-set for this scope.
356+
357+
Parallel to ``_field_patterns_scope``; public entry points that want
358+
header-name extensions to take effect for the call must enter this scope.
359+
"""
360+
token = _HEADER_SETS_CTX.set(_resolve_header_sets(custom_patterns))
361+
try:
362+
yield
363+
finally:
364+
_HEADER_SETS_CTX.reset(token)
365+
366+
288367
# Redaction placeholder - single source of truth
289368
REDACTED = "[REDACTED]"
290369

@@ -369,6 +448,11 @@ def sanitize_header_value(
369448
) -> str:
370449
"""Sanitize a header value if it's sensitive.
371450
451+
Reads the active header-name set from ``_header_sets_scope`` when one is
452+
entered (top-level entry points like ``sanitize_entry`` set it from
453+
``custom_patterns``); otherwise uses the module-wide defaults from
454+
``sensitive.json``.
455+
372456
Args:
373457
name: Header name
374458
value: Header value
@@ -385,11 +469,12 @@ def sanitize_header_value(
385469
'text/html'
386470
"""
387471
name_lower = name.lower()
472+
sets = _HEADER_SETS_CTX.get()
388473

389-
if name_lower in _FULL_REDACT_HEADERS:
474+
if name_lower in sets.full_redact:
390475
return _redact_value(value, hasher, "AUTH", collector)
391476

392-
if name_lower in _COOKIE_REDACT_HEADERS:
477+
if name_lower in sets.cookie_redact:
393478
# Detect cookie attribute metadata (e.g., "HttpOnly: true, Secure: true")
394479
# that was incorrectly serialized as the header value
395480
if is_cookie_attribute_metadata(value):
@@ -503,7 +588,10 @@ def sanitize_post_data(
503588

504589
result = copy.deepcopy(post_data)
505590

506-
with _field_patterns_scope(custom_patterns):
591+
with (
592+
_field_patterns_scope(custom_patterns),
593+
_header_sets_scope(custom_patterns),
594+
):
507595
# Sanitize params array
508596
if "params" in result and isinstance(result["params"], list):
509597
for param in result["params"]:
@@ -1101,13 +1189,15 @@ def sanitize_entry(
11011189
salt: Salt for hashed redaction (ignored if collector provided)
11021190
custom_patterns: Optional additive custom patterns (file path or dict
11031191
matching the ``load_sensitive_patterns`` schema). Extends the
1104-
built-in ``pii.patterns``, allowlist, and sensitive-field sets
1105-
(``fields.auto_redact_patterns`` / ``fields.flag_patterns``) for
1106-
this call only. Propagates end-to-end: the scope is entered by
1107-
``sanitize_post_data`` and ``sanitize_html`` downstream, so
1108-
field-name extensions reach form params, JSON/XML bodies, and
1109-
inline-script scanners. Module-global state is never mutated.
1110-
See ``sanitize_post_data`` for the full contract.
1192+
built-in ``pii.patterns``, allowlist, sensitive-field sets
1193+
(``fields.auto_redact_patterns`` / ``fields.flag_patterns``), and
1194+
header sets (``headers.full_redact`` / ``headers.cookie_redact``)
1195+
for this call only. Both the field-pattern and header-set scopes
1196+
are entered at this level, so every downstream detection site —
1197+
request/response headers, cookies, ``queryString`` params, URL
1198+
query params, POST bodies (form / JSON / XML), and inline-script
1199+
scanners — sees the extensions. Module-global state is never
1200+
mutated. See ``sanitize_post_data`` for the full contract.
11111201
collector: Optional collector for tracking redactions
11121202
heuristics: Heuristic mode for pipe-delimited value detection
11131203
_skip_copy: If True, skip deep copy (caller already copied). Internal use only.
@@ -1125,11 +1215,15 @@ def sanitize_entry(
11251215
# No salt and no collector - create collector with no hashing
11261216
collector = RedactionCollector(hasher=Hasher.create(None))
11271217

1128-
if "request" in result:
1129-
_sanitize_request(result["request"], collector.hasher, collector, custom_patterns, heuristics)
1218+
with (
1219+
_field_patterns_scope(custom_patterns),
1220+
_header_sets_scope(custom_patterns),
1221+
):
1222+
if "request" in result:
1223+
_sanitize_request(result["request"], collector.hasher, collector, custom_patterns, heuristics)
11301224

1131-
if "response" in result:
1132-
_sanitize_response(result["response"], collector, custom_patterns, heuristics)
1225+
if "response" in result:
1226+
_sanitize_response(result["response"], collector, custom_patterns, heuristics)
11331227

11341228
return result
11351229

src/har_capture/sanitization/html.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -275,15 +275,19 @@ def sanitize_html(
275275
hasher = Hasher.create(salt)
276276
collector = RedactionCollector(hasher=hasher)
277277

278-
# Enter the field-pattern scope so is_sensitive_field() calls inside the
279-
# HTML scanner honor custom_patterns. Lazy import: har imports html at
280-
# module level, so we'd cycle if this were top-level.
278+
# Enter the field-pattern + header-set scopes so is_sensitive_field() and
279+
# sanitize_header_value() calls inside the HTML scanner honor custom_patterns.
280+
# Lazy import: har imports html at module level, so we'd cycle if this were
281+
# top-level.
281282
from har_capture.sanitization.har import (
282283
_FIELD_PATTERNS_CTX,
284+
_HEADER_SETS_CTX,
283285
_resolve_field_patterns,
286+
_resolve_header_sets,
284287
)
285288

286-
_scope_token = _FIELD_PATTERNS_CTX.set(_resolve_field_patterns(custom_patterns))
289+
_field_token = _FIELD_PATTERNS_CTX.set(_resolve_field_patterns(custom_patterns))
290+
_header_token = _HEADER_SETS_CTX.set(_resolve_header_sets(custom_patterns))
287291
try:
288292
return _sanitize_html_impl(
289293
html,
@@ -293,7 +297,8 @@ def sanitize_html(
293297
heuristics=heuristics,
294298
)
295299
finally:
296-
_FIELD_PATTERNS_CTX.reset(_scope_token)
300+
_HEADER_SETS_CTX.reset(_header_token)
301+
_FIELD_PATTERNS_CTX.reset(_field_token)
297302

298303

299304
def _sanitize_html_impl(

0 commit comments

Comments
 (0)