feat(audit): defensive redaction pass on log + manifest writes (#148)

eFAILution · web-flow · commit 6dc1ba364282 · 2026-05-13T10:27:26.000-04:00
Closes hardening item #5 from "Secret Handling & Credential Surface Hardening" in docs/developer/SDK-ROADMAP.md. Adds a recursive walker that masks every string in the audit trail at serialization time, plus fixes a pre-existing bug where masking record.msg corrupted %s format strings whose placeholders matched the token: pattern. argus/audit/secrets.py: - New mask_secrets_in_obj(obj) walker. Recurses through dicts, lists, tuples; applies mask_secrets to every string value; leaves keys and non-string scalars untouched; returns a new structure (does NOT mutate the input). argus/audit/logger.py: - JsonLogFormatter.format: stop mutating record.msg. Mask the rendered record.getMessage() instead — catches secrets passed as printf-style args (logger.info("token: %s", real_token)) which the prior approach missed because record.msg held the format string, not the rendered output. Then walk the assembled JSON entry through mask_secrets_in_obj so extra fields a contributor might add to the formatter also get masked. - ColoredConsoleFormatter.format: same fix — mask the rendered message, not record.msg. Without this fix, "token: %s" matched the token=/token: pattern and was rewritten to "token: <REDACTED>", then record.getMessage() raised TypeError trying to substitute args into a format string with no %s placeholder. Bug had been silently masked because no test exercised the printf path. argus/audit/manifest.py: - AuditManifest.save: walk asdict(self) through mask_secrets_in_obj before json.dumps. Defense-in-depth: today's manifest schema doesn't include credential fields, but if a future field captures a docker_cmd, env dict, or credential-shaped argv it gets masked before hitting argus-audit.json. Design note (vs. roadmap text): The roadmap entry suggested reusing core/redact.redact_high_risk_patterns (the vendor-prefix-only set used by Finding.__post_init__). The existing audit/secrets.mask_secrets already covers that surface plus broader patterns appropriate for log lines (token=, password=, Bearer, URL creds, sk-keys). Extending audit/secrets keeps the redactor co-located with its callers — easier to reason about and no cross-module hop at hot-path serialization time. Test coverage (19 new): - argus/tests/audit/test_secrets.py::TestMaskSecretsInObj — 10 tests: root scalar, dict value, nested dict, list, tuple, scalar passthrough, no-mutation guard, deeply nested mix, dict-key preservation, unknown type passthrough. - argus/tests/audit/test_logger.py::TestJsonLogSecretLeakProtection — 4 tests: format-string secret, record.args secret (the regression fix), extra-field secret, non-secret strings preserved unchanged. - argus/tests/audit/test_manifest.py::TestManifestSecretLeakProtection — 5 tests: phase error, artifact path, nested dict at depth 4, input not mutated after save, clean-manifest false-positive guard. .ai/architecture.yaml: new audit/ entry in both SDK structure blocks documenting the redaction posture, the walker, and the rendered-message masking rationale. Full suite: 3126 passed (+19 new), 2 skipped. Co-authored-by: eFAILution <eFAILution@users.noreply.github.com>
diff --git a/.ai/architecture.yaml b/.ai/architecture.yaml
@@ -49,6 +49,7 @@ components:
       "linters/": "Linter modules implementing Scanner protocol (LINTER_REGISTRY auto-merges into SCANNER_REGISTRY)"
       "reporters/": "Output reporters (terminal, markdown, sarif, json, github, gitlab, junit). Discovered via the ``argus.reporters`` Python entry-point group (built-ins declared in pyproject.toml; third-party packages register additional formats without forking — see docs/contributing-reporters.md and ADR-023)."
       "preflight/": "CI preflight: provider detection, living issue reporting (GitHub/GitLab), network deps, scanner tool-readiness checks (tool_check.py)"
+      "audit/": "Structured audit trail for every scan run. logger.py emits JSONL log records (one per line) into ``argus-results/.../argus.log``; manifest.py writes the per-run AuditManifest summary into ``argus-audit.json``. secrets.py provides mask_secrets (regex masking for token=, password=, Bearer, URL-creds, GitHub PATs, AWS access keys, sk-keys) and the mask_secrets_in_obj recursive walker. Both write paths run mask_secrets_in_obj at serialization time as defense-in-depth — if a future contributor accidentally captures a docker_cmd / env dict / credential-shaped argv into a manifest field or log entry, the redaction pass catches it before the file lands. Both logger formatters mask the rendered ``record.getMessage()`` rather than ``record.msg`` to avoid corrupting %s format strings whose placeholders match secret-shaped patterns."
 
   - name: argus-linters
     location: "argus/linters/"
diff --git a/argus/audit/logger.py b/argus/audit/logger.py
@@ -55,11 +55,12 @@ def __init__(self, use_color: bool = True):
         self._use_color = use_color
 
     def format(self, record: logging.LogRecord) -> str:
-        record.msg = mask_secrets(str(record.msg))
-
+        # Mask the rendered output, NOT record.msg — see the note in
+        # JsonLogFormatter.format() for why masking the format string
+        # breaks ``%s`` substitution.
         ts = self.formatTime(record, "%H:%M:%S")
         name = record.name.replace("argus.", "")
-        msg = record.getMessage()
+        msg = mask_secrets(record.getMessage())
 
         if not self._use_color:
             return f"{ts} {record.levelname:<8} {name} {msg}"
@@ -78,7 +79,13 @@ class JsonLogFormatter(logging.Formatter):
     """
 
     def format(self, record: logging.LogRecord) -> str:
-        record.msg = mask_secrets(str(record.msg))
+        # NOTE: do NOT mask ``record.msg`` directly. The msg is a
+        # printf-style format string; if its non-whitespace section
+        # matches a secret-masking pattern (e.g., ``"token: %s"`` matches
+        # ``r"token[=:]\s*[^\s]+"``), masking turns the format string
+        # into a literal with no placeholders, and the subsequent
+        # ``record.getMessage()`` raises TypeError on ``%s`` substitution.
+        # Mask the *rendered* message string below instead.
         entry: dict = {
             "timestamp": datetime.fromtimestamp(
                 record.created, tz=timezone.utc
@@ -87,13 +94,25 @@ def format(self, record: logging.LogRecord) -> str:
             "module": record.name,
             "function": record.funcName,
             "line": record.lineno,
-            "message": record.getMessage(),
+            # ``getMessage()`` re-renders the format string with
+            # ``record.args`` interpolated back in. If a caller passed
+            # a secret as a printf-style arg
+            # (``logger.info("token: %s", real_token)``), that secret
+            # never touched ``record.msg`` — only ``record.args`` — so
+            # masking ``record.msg`` above wouldn't catch it. We mask
+            # the *rendered* result to cover both paths.
+            "message": mask_secrets(record.getMessage()),
         }
         # Include extra fields if present (scanner name, phase, etc.)
         for key in ("scanner", "phase", "image", "duration_ms"):
             if hasattr(record, key):
                 entry[key] = getattr(record, key)
-        return json.dumps(entry, default=str)
+        # Defense-in-depth: walk the assembled entry and re-mask any
+        # string a contributor might add to the ``extra`` set above
+        # without remembering to mask first. ``mask_secrets_in_obj``
+        # is idempotent — already-masked values pass through unchanged.
+        from argus.audit.secrets import mask_secrets_in_obj
+        return json.dumps(mask_secrets_in_obj(entry), default=str)
 
 
 # ---------------------------------------------------------------------------
diff --git a/argus/audit/manifest.py b/argus/audit/manifest.py
@@ -75,12 +75,26 @@ class AuditManifest:
     artifacts: list[dict] = field(default_factory=list)
 
     def save(self, output_dir: str | Path) -> Path:
-        """Write the manifest to ``argus-audit.json``."""
+        """Write the manifest to ``argus-audit.json``.
+
+        Every string in the manifest passes through
+        ``mask_secrets_in_obj`` before serialization — defense-in-depth
+        so a future regression that captures a ``docker_cmd``,
+        environment dict, or credential-shaped argv into a manifest
+        field can't leak. Today the manifest schema doesn't include
+        such fields; the redaction pass is here so it stays that way
+        regardless of contributor vigilance.
+        """
+        from argus.audit.secrets import mask_secrets_in_obj
+
         dest = Path(output_dir)
         dest.mkdir(parents=True, exist_ok=True)
         filepath = dest / "argus-audit.json"
         filepath.write_text(
-            json.dumps(asdict(self), indent=2, default=str),
+            json.dumps(
+                mask_secrets_in_obj(asdict(self)),
+                indent=2, default=str,
+            ),
             encoding="utf-8",
         )
         return filepath
diff --git a/argus/audit/secrets.py b/argus/audit/secrets.py
@@ -55,3 +55,36 @@ def mask_secrets(message: str) -> str:
     for pattern, replacement in _PATTERNS:
         message = pattern.sub(replacement, message)
     return message
+
+
+def mask_secrets_in_obj(obj):
+    """Recursively apply ``mask_secrets`` to every string in ``obj``.
+
+    Walks dicts, lists, and tuples; leaves non-string scalars and
+    unknown types untouched. Returns a new structure (does NOT mutate
+    the input) so the original objects remain available to callers
+    that haven't fully migrated to the masked view.
+
+    The defense-in-depth use case: callers serialize structured data
+    (audit manifests, JSON log entries, anything else that could end
+    up on disk) through this walker before writing. Even if a future
+    contributor accidentally feeds a credential into a manifest field
+    or a logger.info(..., real_secret) call, the pattern set in
+    ``_PATTERNS`` catches the value before it reaches the filesystem.
+
+    Dict keys are not masked — a key that is itself a secret is an
+    extreme outlier in practice and key-masking would force every
+    JSON consumer to re-derive the key set. Values cover the realistic
+    leak surface.
+    """
+    if isinstance(obj, str):
+        return mask_secrets(obj)
+    if isinstance(obj, dict):
+        return {k: mask_secrets_in_obj(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [mask_secrets_in_obj(v) for v in obj]
+    if isinstance(obj, tuple):
+        return tuple(mask_secrets_in_obj(v) for v in obj)
+    # Scalars (int, float, bool, None) and unknown types pass through
+    # unchanged; the caller's json.dumps(default=str) handles encoding.
+    return obj
diff --git a/argus/tests/audit/test_logger.py b/argus/tests/audit/test_logger.py
@@ -229,3 +229,68 @@ def test_existing_logger_honors_later_verbose(self):
             h for h in logger.handlers if isinstance(h, logging.StreamHandler)
         )
         assert console_handler.level == logging.DEBUG
+
+
+class TestJsonLogSecretLeakProtection:
+    """End-to-end: planted secrets in log calls never reach the on-disk
+    argus.log file. Defense-in-depth — closes hardening item (5).
+    """
+
+    def _read_log_lines(self, output_dir: Path) -> list[str]:
+        log_path = output_dir / "argus.log"
+        return log_path.read_text().strip().split("\n")
+
+    def test_secret_in_format_string_masked(self, tmp_path):
+        """Caller embeds the secret directly in the format string."""
+        output_dir = tmp_path / "logs"
+        logger = get_logger("argus.test.leak.fmt", output_dir=output_dir)
+        logger.info("auth header: Bearer eyJhbGc.secret-payload.signature")
+        for h in logger.handlers:
+            h.flush()
+
+        for line in self._read_log_lines(output_dir):
+            assert "eyJhbGc.secret-payload.signature" not in line
+
+    def test_secret_in_record_args_masked(self, tmp_path):
+        """Caller passes the secret as a printf-style arg.
+
+        Pre-fix, the formatter masked ``record.msg`` (the format
+        string) but ``record.getMessage()`` re-rendered the message
+        with ``record.args`` interpolated back in — secret bypassed
+        the mask entirely. The mask now runs on the rendered output.
+        """
+        output_dir = tmp_path / "logs"
+        logger = get_logger("argus.test.leak.args", output_dir=output_dir)
+        secret_token = "ghp_abcdef1234567890abcdef1234567890"
+        logger.info("registry token: %s", secret_token)
+        for h in logger.handlers:
+            h.flush()
+
+        for line in self._read_log_lines(output_dir):
+            assert secret_token not in line
+
+    def test_secret_in_extra_field_masked(self, tmp_path):
+        """Caller passes the secret via the ``extra`` kwarg."""
+        output_dir = tmp_path / "logs"
+        logger = get_logger("argus.test.leak.extra", output_dir=output_dir)
+        logger.info(
+            "scan starting",
+            extra={"scanner": "scanner-with-AKIA1234567890ABCDEF-creds"},
+        )
+        for h in logger.handlers:
+            h.flush()
+
+        for line in self._read_log_lines(output_dir):
+            assert "AKIA1234567890ABCDEF" not in line
+
+    def test_non_secret_strings_pass_through_unchanged(self, tmp_path):
+        """Confirm we're not masking ordinary strings."""
+        output_dir = tmp_path / "logs"
+        logger = get_logger("argus.test.leak.clean", output_dir=output_dir)
+        logger.info("scanning %s with %d workers", "argus.yml", 4)
+        for h in logger.handlers:
+            h.flush()
+
+        line = self._read_log_lines(output_dir)[0]
+        assert "argus.yml" in line
+        assert "4 workers" in line
diff --git a/argus/tests/audit/test_manifest.py b/argus/tests/audit/test_manifest.py
@@ -190,3 +190,92 @@ def test_no_summary_leaves_empty(self, tmp_path):
         finalize_manifest(m, summary=None, output_dir=tmp_path)
         assert m.findings_summary == {}
         assert m.scanners_executed == []
+
+
+class TestManifestSecretLeakProtection:
+    """End-to-end: planted secrets in manifest fields never reach disk.
+
+    The manifest schema today doesn't include credential fields. This
+    test guards against a future regression that adds one (a
+    ``docker_cmd`` capture, an env dict, error messages from phases
+    that quote the failing command) — the redaction pass at save()
+    time catches it before the file lands.
+    """
+
+    def test_secret_in_phase_error_redacted(self, tmp_path):
+        m = AuditManifest(scan_id="leak-test", argus_version="1.0.0")
+        # Future regression: phase capture includes the failing
+        # subprocess argv, which includes a credential.
+        m.phases.append({
+            "name": "container_pull",
+            "status": "failure",
+            "error": "docker login failed with token=ghp_secret_1234567890abcdef",
+        })
+
+        filepath = m.save(tmp_path)
+        content = filepath.read_text()
+
+        assert "ghp_secret_1234567890abcdef" not in content
+        # Phase metadata that isn't a secret survives
+        assert "container_pull" in content
+        assert "failure" in content
+
+    def test_secret_in_artifact_path_redacted(self, tmp_path):
+        m = AuditManifest(scan_id="leak-test", argus_version="1.0.0")
+        # Pathological: an artifact path that embedded a token
+        # (e.g., a URL-shaped path with creds in it).
+        m.artifacts.append({
+            "name": "remote-fetch.log",
+            "path": "https://user:AKIA1234567890ABCDEF@bucket.s3.amazonaws.com/x",
+        })
+
+        filepath = m.save(tmp_path)
+        content = filepath.read_text()
+
+        assert "AKIA1234567890ABCDEF" not in content
+
+    def test_nested_dict_redacted(self, tmp_path):
+        """Secret several levels deep — walker must recurse."""
+        m = AuditManifest(scan_id="leak-test", argus_version="1.0.0")
+        m.phases.append({
+            "name": "auth",
+            "env_snapshot": {
+                "credentials": {
+                    "registry": "Bearer eyJhbGc.realtoken.signature",
+                    "non_secret": "/usr/bin",
+                },
+            },
+        })
+
+        filepath = m.save(tmp_path)
+        content = filepath.read_text()
+
+        assert "eyJhbGc.realtoken.signature" not in content
+        # Non-secret nested value survives
+        assert "/usr/bin" in content
+
+    def test_input_dict_not_mutated_after_save(self, tmp_path):
+        """Caller's manifest object stays unmodified after save()."""
+        m = AuditManifest(scan_id="leak-test", argus_version="1.0.0")
+        m.phases.append({
+            "name": "p",
+            "error": "token=ghp_aaaa1111bbbb2222cccc3333dddd4444",
+        })
+        original_error = m.phases[0]["error"]
+
+        m.save(tmp_path)
+
+        # In-memory view of the manifest unchanged — the redaction
+        # only affects what hits disk.
+        assert m.phases[0]["error"] == original_error
+
+    def test_clean_manifest_writes_no_redaction_tokens(self, tmp_path):
+        """A manifest with zero secret-shaped data has no false-positive
+        ``<REDACTED>`` markers."""
+        m = AuditManifest(scan_id="clean-test", argus_version="1.0.0")
+        m.phases.append({"name": "init", "status": "success"})
+
+        filepath = m.save(tmp_path)
+        content = filepath.read_text()
+
+        assert "<REDACTED>" not in content
diff --git a/argus/tests/audit/test_secrets.py b/argus/tests/audit/test_secrets.py
@@ -83,3 +83,104 @@ def test_case_insensitive_token(self):
     def test_case_insensitive_bearer(self):
         result = mask_secrets("bearer abc123def456ghi789jkl")
         assert "abc123def456ghi789jkl" not in result
+
+
+class TestMaskSecretsInObj:
+    """Recursive walker — defense-in-depth for audit-trail writes."""
+
+    def test_masks_string_at_root(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj("token=ghp_supersecret123456789")
+        assert REDACTED in result
+        assert "ghp_supersecret" not in result
+
+    def test_masks_string_in_dict_value(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj({
+            "config_path": "argus.yml",
+            "auth_header": "Bearer eyJhbGc.signature",
+        })
+        assert result["config_path"] == "argus.yml"
+        assert REDACTED in result["auth_header"]
+        assert "eyJhbGc" not in result["auth_header"]
+
+    def test_masks_nested_dict(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj({
+            "phase": "scan",
+            "env": {
+                "REGISTRY_TOKEN": "AKIA1234567890ABCDEF",
+                "PATH": "/usr/bin",
+            },
+        })
+        assert REDACTED in result["env"]["REGISTRY_TOKEN"]
+        assert result["env"]["PATH"] == "/usr/bin"
+        assert result["phase"] == "scan"
+
+    def test_masks_list_of_strings(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj([
+            "docker run",
+            "password=hunter2",
+            "myapp:latest",
+        ])
+        assert result[0] == "docker run"
+        assert REDACTED in result[1]
+        assert "hunter2" not in result[1]
+        assert result[2] == "myapp:latest"
+
+    def test_masks_through_tuple(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj(("normal", "token=sk-abc123def456ghi789"))
+        assert isinstance(result, tuple)
+        assert result[0] == "normal"
+        assert REDACTED in result[1]
+
+    def test_scalars_pass_through(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        assert mask_secrets_in_obj(42) == 42
+        assert mask_secrets_in_obj(3.14) == 3.14
+        assert mask_secrets_in_obj(True) is True
+        assert mask_secrets_in_obj(None) is None
+
+    def test_does_not_mutate_input(self):
+        from argus.audit.secrets import mask_secrets_in_obj
+        original = {"creds": {"token": "ghp_supersecret123456"}}
+        result = mask_secrets_in_obj(original)
+        # Caller's original is untouched
+        assert original["creds"]["token"] == "ghp_supersecret123456"
+        # Returned copy is masked
+        assert "ghp_supersecret" not in result["creds"]["token"]
+
+    def test_deeply_nested_mix(self):
+        """Realistic shape: dict of lists of dicts of strings."""
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj({
+            "phases": [
+                {"name": "init", "command": ["argus", "scan"]},
+                {"name": "auth", "command": ["docker", "login", "-p",
+                                              "ghp_secret_1234567890abcdef"]},
+            ],
+        })
+        # Non-secret strings preserved
+        assert result["phases"][0]["command"] == ["argus", "scan"]
+        # Secret-shaped string masked even at depth 4
+        assert "ghp_secret" not in result["phases"][1]["command"][3]
+        assert REDACTED in result["phases"][1]["command"][3]
+
+    def test_dict_keys_not_masked(self):
+        """Keys pass through unchanged — masking them would break consumers."""
+        from argus.audit.secrets import mask_secrets_in_obj
+        result = mask_secrets_in_obj({"ghp_keyname": "regular value"})
+        assert "ghp_keyname" in result  # key intact
+        assert result["ghp_keyname"] == "regular value"
+
+    def test_unknown_object_passes_through(self):
+        """Custom types we don't recognize pass through unchanged."""
+        from argus.audit.secrets import mask_secrets_in_obj
+
+        class Custom:
+            pass
+
+        obj = Custom()
+        assert mask_secrets_in_obj(obj) is obj
diff --git a/docs/developer/SDK-ROADMAP.md b/docs/developer/SDK-ROADMAP.md