Skip to content

observability: implement redaction filter (default strict)#107

Merged
AbdelStark merged 1 commit into
mainfrom
observability/redaction-filter
May 20, 2026
Merged

observability: implement redaction filter (default strict)#107
AbdelStark merged 1 commit into
mainfrom
observability/redaction-filter

Conversation

@AbdelStark

Copy link
Copy Markdown
Owner

Problem

RFC-0013 §3.5 and docs/spec/06-security.md require a single redaction chokepoint between callers and the log sink: DNA strings ≥ 20 bp, deny-listed field names, and out-of-allowlist payloads must never leak. INV-OBS-4 ("DNA strings ≥ 20 bp never appear in any log record") is non-negotiable.

Solution

  • geno_lewm/_redaction.py — four-rule filter:

    1. Per-event allowlist (EventSpec.allowed_keys): keys not listed are soft-dropped (registry drift, not a bypass — does NOT raise even in strict mode).
    2. Type allowlist: None | bool | int | float | str and shallow containers thereof. bytes / sets / deep nesting / tensors all drop; raise in strict mode.
    3. DNA pattern: ^[ACGTNacgtn]{20,}$ matched at any depth → drop; raise in strict mode.
    4. Deny list (vcf_content, genotype, sample_id, user_email, email, phone, address, dob, birthdate) — at any depth → drop; raise in strict mode.

    Strict mode is on by default (GENO_LEWM_REDACTION_STRICT=0 disables). All raises go through geno_lewm.errors.InvariantViolation — the linter from errors: AST linter raise_geno_lewm_error and registered_error_code #22 confirms.

  • geno_lewm/observability.py — extends EventSpec with allowed_keys: frozenset[str]. Populated for every v0.1 event with realistic key sets. Tightening is MAJOR; adding is MINOR.

  • The filter is wired into GenoLeWMLogger._log so every record passes through the chokepoint before JSON serialization. Unknown events fall back to an empty allowlist — every payload key is soft-dropped, no leak possible.

  • RedactionStats (counter object): dropped_keys / dropped_denied / dropped_dna / dropped_type. The metric geno_lewm.observability.redacted_keys will be exported by observability: implement metrics registry + Prometheus textfile exporter #25 — today the counter is observable via redaction_stats().

Validation

$ python -m pytest tests/ -q
205 passed in 0.62s

$ python -m tools.lint.check_error_codes
$ echo $?
0

Tests:

  • tests/unit/test_redaction.py (16 cases): strict-mode raises on every rule including nested DNA / deny-listed keys; permissive drops + counts; type allowlist (scalars / lists / shallow dicts); defensive copy; per-event allowlist round-trip for every registered event.
  • tests/property/test_redaction.py (2 tests, 12k payloads total): permissive run over 10k random payloads — zero leaks (the acceptance criterion); strict-mode run over 2k — either raises cleanly or zero leaks.
  • Existing observability tests updated where they used keys outside the new allowlists (only one record-shape test had to change: cfgconfig_path).

Caveats / out of scope

Closes #24

Add geno_lewm/_redaction.py as the single chokepoint between callers
and the JSONL sink (RFC-0013 §3.5, docs/spec/05-observability.md,
docs/spec/06-security.md). Four rules:

- Per-event allowlist: keys not in EventSpec.allowed_keys are
  soft-dropped and counted (registry drift, not a bypass — does NOT
  raise even in strict mode).
- Type allowlist: only None/bool/int/float/str/list-of-scalars/
  shallow-dict-of-scalars allowed. bytes/tensors/sets/deep nesting
  drop (and raise in strict mode).
- DNA pattern: ^[ACGTNacgtn]{20,}$ matched at any depth drops
  (raise in strict mode).
- Deny-list: vcf_content / genotype / sample_id / user_email /
  email / phone / address / dob / birthdate — at any depth — drop
  (raise in strict mode).

Strict mode is on by default; GENO_LEWM_REDACTION_STRICT=0 disables.

Extend EventSpec with allowed_keys (frozenset[str]). Populate sensible
defaults for every v0.1 event. Tightening a set is MAJOR; adding a key
is MINOR.

Wire the filter into GenoLeWMLogger._log so every record passes through
the filter before JSON serialization. Unknown events fall back to an
empty allowlist (everything in data soft-drops, no leak possible).

RedactionStats tracks dropped_keys / dropped_denied / dropped_dna /
dropped_type. The metric geno_lewm.observability.redacted_keys
(RFC-0013 §4) will be exported by #25 — today the counter is
observable via redaction_stats().

Tests:
- tests/unit/test_redaction.py (16 cases): strict-mode raises on each
  rule violation including nested DNA & deny-listed keys; permissive
  mode drops + counts; type allowlist (scalars / lists / shallow
  dicts); defensive copy; per-event allowlist round-trips for every
  registered event.
- tests/property/test_redaction.py (2 tests, 12k payloads total):
  permissive run over 10k random payloads — zero leaks; strict-mode
  run over 2k — either raises or zero leaks.

Existing observability test updated: training.run.start payload now
uses config_path (in the allowlist) instead of the bare ``cfg`` key.

Closes #24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

observability: implement redaction filter (default strict)

1 participant