fix(engine): defer to scanner.scan when build_args is missing

eFAILution · eFAILution · commit 46244f06c847 · 2026-05-05T23:03:04.000-04:00
Scanners with custom ``scan()`` flows that don't fit the standard ``build_args(ScanPaths) -> list[str]`` contract (linters that walk the workspace and invoke their tool per file) used to AttributeError inside ``_run_in_container`` when the engine routed them through the container path. Combined with PR #117's silent-drop loophole, that made ``lint-dockerfile`` disappear from canonical results entirely when hadolint was not installed locally. Engine change: in ``_run_scanner``, the auto/docker branch now checks for ``build_args`` or ``container_args`` before entering ``_run_in_container``. When neither is present: * backend=auto: log a debug message and fall through to the local path (which calls ``scanner.scan(path, config)`` directly). * backend=docker: raise a clear RuntimeError naming the constraint ("scanner has container_image but no build_args/container_args method") so users know to either implement build_args or relax the backend. HadolintLinter cleanup: collapse the per-Dockerfile subprocess loop into a single ``hadolint --format json file1 file2 ...`` invocation. Hadolint accepts multiple paths and emits one combined JSON array with each finding's source ``file`` field intact, so the parser still produces correct ``location`` strings without threading the path back through the caller. Drops one process spawn per Dockerfile on every scan. Roadmap: docs/developer/SDK-ROADMAP.md adds a FileDiscoveryScanner template entry under Known Issues. The engine fallback gives every linter a working escape hatch today, but the real abstraction would be a base class that handles workspace walks, file globbing, container vs local routing, and batched invocation centrally so the six existing linters stop reimplementing it. Deferred until a second linter contributor copy-pastes the boilerplate; trigger to revisit documented inline. Two new regression tests in ``TestDockerExecutionBackend``: test_auto_backend_defers_to_scan_when_no_build_args (the lint-dockerfile fix path) and test_docker_backend_rejects_scanner_without_build_args (the loud error when the user opted into container-only). 1519 SDK tests pass.
diff --git a/argus/core/engine.py b/argus/core/engine.py
@@ -950,7 +950,21 @@ def _run_scanner(
         if backend in ("auto", "docker"):
             container_image = getattr(scanner, "container_image", "")
 
-            if container_image and self._is_docker_available():
+            # The engine's container path drives ``docker run`` from the
+            # scanner's argv shape — either ``build_args(paths, config)``
+            # (PR #117) or the legacy ``container_args(config)``. Scanners
+            # without either method (linters with custom scan() flows
+            # like HadolintLinter that walk the workspace and invoke
+            # their tool per file) can't be driven that way; defer to
+            # ``scanner.scan()`` and let it handle execution. ``auto``
+            # mode falls through to the local path below; ``docker``
+            # mode raises so the constraint is loud.
+            container_capable = (
+                hasattr(scanner, "build_args")
+                or hasattr(scanner, "container_args")
+            )
+
+            if container_image and container_capable and self._is_docker_available():
                 logger.debug(
                     "Backend '%s': using container for '%s' (image=%s)",
                     backend,
@@ -969,6 +983,21 @@ def _run_scanner(
                         scanner.name,
                         exc,
                     )
+            elif container_image and not container_capable:
+                if backend == "docker":
+                    raise RuntimeError(
+                        f"Scanner '{scanner.name}' has a container_image "
+                        f"but no build_args/container_args method, and "
+                        f"backend is 'docker'. Implement build_args() or "
+                        f"set backend to 'auto'/'local' to use the "
+                        f"scanner's own scan() method."
+                    )
+                logger.debug(
+                    "Backend 'auto': scanner '%s' has no build_args/"
+                    "container_args — deferring to scanner.scan() instead "
+                    "of the container path",
+                    scanner.name,
+                )
 
             # docker backend requires containers — fail explicitly
             if backend == "docker":
diff --git a/argus/linters/hadolint.py b/argus/linters/hadolint.py
@@ -20,7 +20,14 @@ class HadolintLinter:
     container_image = get_image("hadolint")
 
     def scan(self, path: str, config: dict | None = None) -> ScanResult:
-        """Find Dockerfiles under path and lint each with hadolint."""
+        """Find Dockerfiles under *path* and lint them all in one hadolint invocation.
+
+        Hadolint accepts multiple file paths on its CLI (``hadolint
+        file1 file2 ...``) and emits a single JSON array spanning every
+        file's findings. Doing one batched call beats spawning
+        ``len(dockerfiles)`` subprocesses by N startup costs and keeps
+        the per-finding ``file`` field intact in the parsed output.
+        """
         config = config or {}
         target = Path(path)
 
@@ -31,12 +38,37 @@ def scan(self, path: str, config: dict | None = None) -> ScanResult:
                 metadata={"info": "No Dockerfiles found"},
             )
 
-        all_findings: list[Finding] = []
-        for dockerfile in dockerfiles:
-            findings = self._lint_file(dockerfile, config)
-            all_findings.extend(findings)
+        cmd = self._build_command(dockerfiles, config)
+        result = subprocess.run(cmd, capture_output=True, text=True)
+
+        # hadolint exits 0 when clean, non-zero when findings exist —
+        # both are the happy path. Empty stdout means a real error
+        # (binary missing, parse failure inside hadolint, etc.).
+        if not result.stdout.strip():
+            return ScanResult(
+                scanner=self.name,
+                metadata={
+                    "execution_failed": True,
+                    "execution_failure_reason": (
+                        f"hadolint produced no output (exit={result.returncode}). "
+                        f"stderr: {(result.stderr or '').strip()[:400]}"
+                    ),
+                },
+            )
+
+        try:
+            data = json.loads(result.stdout)
+        except json.JSONDecodeError as exc:
+            return ScanResult(
+                scanner=self.name,
+                metadata={
+                    "execution_failed": True,
+                    "execution_failure_reason": f"Invalid JSON from hadolint: {exc}",
+                },
+            )
 
-        return ScanResult(scanner=self.name, findings=all_findings)
+        findings = [self._parse_item(item) for item in data]
+        return ScanResult(scanner=self.name, findings=findings)
 
     def is_available(self) -> bool:
         """Check if hadolint is installed."""
@@ -58,46 +90,37 @@ def _find_dockerfiles(self, target: Path) -> list[Path]:
             return [target]
         return sorted(target.rglob("Dockerfile*"))
 
-    def _lint_file(
-        self, dockerfile: Path, config: dict
-    ) -> list[Finding]:
-        """Run hadolint on a single Dockerfile and parse results."""
-        cmd = self._build_command(dockerfile, config)
-
-        result = subprocess.run(cmd, capture_output=True, text=True)
-
-        if not result.stdout.strip():
-            return []
-
-        try:
-            data = json.loads(result.stdout)
-        except json.JSONDecodeError:
-            return []
-
-        return [self._parse_item(item, dockerfile) for item in data]
-
     def _build_command(
-        self, dockerfile: Path, config: dict
+        self, dockerfiles: list[Path], config: dict
     ) -> list[str]:
-        """Build the hadolint CLI command."""
+        """Build a single hadolint command covering every Dockerfile.
+
+        Hadolint takes multiple file arguments and emits one combined
+        JSON array — far cheaper than spawning a process per file.
+        """
         cmd = ["hadolint", "--format", "json"]
 
         config_file = config.get("config_file")
         if config_file:
             cmd.extend(["--config", config_file])
 
-        ignore_rules = config.get("ignore_rules", [])
-        for rule in ignore_rules:
+        for rule in config.get("ignore_rules", []) or []:
             cmd.extend(["--ignore", rule])
 
-        cmd.append(str(dockerfile))
+        cmd.extend(str(p) for p in dockerfiles)
         return cmd
 
-    def _parse_item(self, item: dict, dockerfile: Path) -> Finding:
-        """Convert a single hadolint JSON result into a Finding."""
+    def _parse_item(self, item: dict) -> Finding:
+        """Convert a single hadolint JSON result into a Finding.
+
+        Hadolint emits the source file as ``item["file"]`` when it ran
+        against multiple paths — we use that directly instead of
+        threading the path in via the caller.
+        """
         rule_code = item.get("code", "UNKNOWN")
         line_num = item.get("line", 0)
-        location = f"{dockerfile}:{line_num}"
+        dockerfile = item.get("file", "")
+        location = f"{dockerfile}:{line_num}" if dockerfile else None
 
         return Finding(
             id=rule_code,
@@ -109,6 +132,6 @@ def _parse_item(self, item: dict, dockerfile: Path) -> Finding:
             metadata={
                 "level": item.get("level", ""),
                 "column": item.get("column", 0),
-                "file": str(dockerfile),
+                "file": dockerfile,
             },
         )
diff --git a/argus/tests/test_engine.py b/argus/tests/test_engine.py
@@ -222,6 +222,82 @@ def test_local_backend_fails_if_unavailable(self):
         assert len(summary.results) == 1
         assert summary.results[0].metadata.get("execution_failed") is True
 
+    def test_auto_backend_defers_to_scan_when_no_build_args(self, monkeypatch):
+        """Scanners with a custom ``scan()`` flow but no ``build_args``/
+        ``container_args`` (e.g. linters that walk the workspace and
+        invoke their tool per file) should defer to ``scanner.scan()``
+        instead of the engine's container path.
+
+        Regression for the lint-dockerfile bug: HadolintLinter has
+        ``container_image`` set but no ``build_args``, so the engine
+        used to AttributeError inside ``_run_in_container`` and the
+        scanner silently disappeared from results.
+        """
+        engine = self._make_engine(backend="auto")
+        # Pretend Docker is available so the engine would have chosen
+        # the container path if it could.
+        monkeypatch.setattr(engine, "_is_docker_available", lambda: True)
+
+        captured: dict = {}
+
+        class CustomScanScanner:
+            name = "custom"
+            container_image = "example/custom:1.0"
+            # Deliberately no build_args or container_args.
+
+            def scan(self, path, config=None):
+                captured["scan_called"] = (path, config)
+                return ScanResult(
+                    scanner=self.name,
+                    findings=[Finding(
+                        id="X", severity=Severity.LOW, title="from scan()",
+                    )],
+                )
+
+            def is_available(self):
+                return True
+
+            def install_command(self):
+                return None
+
+        engine.register_scanner(CustomScanScanner())
+        summary = engine.run(scanner_names=["custom"])
+
+        # scan() was called, container path was bypassed, findings flow through.
+        assert captured.get("scan_called") is not None
+        assert len(summary.results) == 1
+        assert len(summary.results[0].findings) == 1
+        assert summary.results[0].findings[0].title == "from scan()"
+
+    def test_docker_backend_rejects_scanner_without_build_args(self, monkeypatch):
+        """``backend: docker`` must fail loudly when a scanner has a
+        container image but no way to run in one — the user explicitly
+        opted into container-only execution and silent fallback would
+        violate that contract."""
+        engine = self._make_engine(backend="docker")
+        monkeypatch.setattr(engine, "_is_docker_available", lambda: True)
+
+        class CustomScanScanner:
+            name = "custom"
+            container_image = "example/custom:1.0"
+
+            def scan(self, path, config=None):
+                return ScanResult(scanner=self.name)
+
+            def is_available(self):
+                return False
+
+            def install_command(self):
+                return None
+
+        engine.register_scanner(CustomScanScanner())
+        summary = engine.run(scanner_names=["custom"])
+        # Surfaces as a failure row with the loud error.
+        assert len(summary.results) == 1
+        meta = summary.results[0].metadata
+        assert meta.get("execution_failed") is True
+        assert "build_args" in meta.get("execution_failure_reason", "")
+
     def test_resolve_image_no_registry(self):
         engine = self._make_engine(registry="")
         scanner = MockScanner("trivy", container_image="aquasec/trivy:0.58.0")
diff --git a/docs/developer/SDK-ROADMAP.md b/docs/developer/SDK-ROADMAP.md
@@ -624,6 +624,45 @@ All engine, scanner, and testing issues from the migration have been resolved.
 
 ---
 
+## FileDiscoveryScanner Template
+
+Linters (`lint-yaml`, `lint-json`, `lint-python`, `lint-javascript`, `lint-dockerfile`, `lint-terraform`) and a few security scanners share a shape that doesn't fit the standard `build_args(ScanPaths) → list[str]` contract introduced in PR #117: they need to **discover files of a specific shape under a workspace, then run their tool against those file paths** (not against the workspace as a whole). Today each one rolls its own `_find_*` walk + per-file subprocess loop in its `scan()` method, which has three problems:
+
+1. **Multi-subprocess inefficiency.** `HadolintLinter.scan()` (pre-PR #120) ran `subprocess.run(['hadolint', dockerfile])` once per Dockerfile in a Python loop — N startup costs for N files. Most of these tools accept a list of paths in a single invocation (`hadolint file1 file2 ...`), so the loop is unnecessary.
+2. **No container-execution support.** The custom `scan()` flows hardcode `subprocess.run(['<binary>', ...])` and crash with `FileNotFoundError` when the binary isn't installed locally. The engine's container backend was added later and never extended to cover the discovery shape.
+3. **Discovery patterns are duplicated.** Every linter implements its own `_find_dockerfiles` / `_find_yaml_files` / etc. with subtly different exclusion logic.
+
+**Proposed shape**: a `FileDiscoveryScanner` mixin or template (analogous to `argus.core.scanner_template.run_subprocess_scan`) that:
+
+```python
+class HadolintLinter(FileDiscoveryScanner):
+    name = "lint-dockerfile"
+    file_glob = "Dockerfile*"            # workspace-relative pattern
+    container_image = get_image("hadolint")
+
+    def build_args(self, files: list[str], output: str) -> list[str]:
+        # Tool that accepts multiple file paths in one invocation.
+        return ["hadolint", "--format", "json", *files]
+
+    def parse_results(self, output_path) -> list[Finding]:
+        ...
+```
+
+The shared template handles:
+- Workspace walk + glob matching with the standard exclusion set
+- Single subprocess call (or `docker run`) with all matched files
+- Container vs local routing via the existing `is_available()` / `container_image` mechanism
+- Output file lifecycle + `parse_results` dispatch
+- Empty-discovery case (return clean ScanResult with `no <files> found` info, not a failure row)
+
+**Why deferred**: PR #120's engine fallback (`scanner.scan()` is honored when `build_args` is missing) gives every scanner a working escape hatch today, and PR #119's failure-row contract makes any remaining edge case visible. The template is a quality-of-life improvement for adding new linters that don't fit the standard shape — worth the design conversation but not load-bearing for any current functionality.
+
+**Trigger to revisit**: when the second new linter contributor copy-pastes the file-discovery boilerplate from `HadolintLinter`. At that point the duplication has earned the abstraction.
+
+**If/when we ship it**: likely `argus/core/file_discovery_scanner.py` exporting the template, plus migrations for the existing 6 linters + any security scanner with a similar shape (e.g. clamav's recursive directory scan). Each migration is a self-contained PR.
+
+---
+
 ## Secret Redaction Hardening
 
 The current redaction model (per-scanner, at the parser) is documented in [`docs/mcp.md` → Secrets handling](../mcp.md). Each scanner that emits potentially-sensitive content audits its own output and replaces secret-bearing fields with the `<redacted>` placeholder before the `Finding` is built. Downstream consumers (terminal reporter, JSON / Markdown / SARIF exports, MCP tool responses, the LLM context window) therefore never see raw values.