Skip to content

Commit 46244f0

Browse files
committed
fix(engine): defer to scanner.scan when build_args is missing
Scanners with custom ``scan()`` flows that don't fit the standard ``build_args(ScanPaths) -> list[str]`` contract (linters that walk the workspace and invoke their tool per file) used to AttributeError inside ``_run_in_container`` when the engine routed them through the container path. Combined with PR #117's silent-drop loophole, that made ``lint-dockerfile`` disappear from canonical results entirely when hadolint was not installed locally. Engine change: in ``_run_scanner``, the auto/docker branch now checks for ``build_args`` or ``container_args`` before entering ``_run_in_container``. When neither is present: * backend=auto: log a debug message and fall through to the local path (which calls ``scanner.scan(path, config)`` directly). * backend=docker: raise a clear RuntimeError naming the constraint ("scanner has container_image but no build_args/container_args method") so users know to either implement build_args or relax the backend. HadolintLinter cleanup: collapse the per-Dockerfile subprocess loop into a single ``hadolint --format json file1 file2 ...`` invocation. Hadolint accepts multiple paths and emits one combined JSON array with each finding's source ``file`` field intact, so the parser still produces correct ``location`` strings without threading the path back through the caller. Drops one process spawn per Dockerfile on every scan. Roadmap: docs/developer/SDK-ROADMAP.md adds a FileDiscoveryScanner template entry under Known Issues. The engine fallback gives every linter a working escape hatch today, but the real abstraction would be a base class that handles workspace walks, file globbing, container vs local routing, and batched invocation centrally so the six existing linters stop reimplementing it. Deferred until a second linter contributor copy-pastes the boilerplate; trigger to revisit documented inline. Two new regression tests in ``TestDockerExecutionBackend``: test_auto_backend_defers_to_scan_when_no_build_args (the lint-dockerfile fix path) and test_docker_backend_rejects_scanner_without_build_args (the loud error when the user opted into container-only). 1519 SDK tests pass.
1 parent a8c54c9 commit 46244f0

4 files changed

Lines changed: 201 additions & 34 deletions

File tree

argus/core/engine.py

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -950,7 +950,21 @@ def _run_scanner(
950950
if backend in ("auto", "docker"):
951951
container_image = getattr(scanner, "container_image", "")
952952

953-
if container_image and self._is_docker_available():
953+
# The engine's container path drives ``docker run`` from the
954+
# scanner's argv shape — either ``build_args(paths, config)``
955+
# (PR #117) or the legacy ``container_args(config)``. Scanners
956+
# without either method (linters with custom scan() flows
957+
# like HadolintLinter that walk the workspace and invoke
958+
# their tool per file) can't be driven that way; defer to
959+
# ``scanner.scan()`` and let it handle execution. ``auto``
960+
# mode falls through to the local path below; ``docker``
961+
# mode raises so the constraint is loud.
962+
container_capable = (
963+
hasattr(scanner, "build_args")
964+
or hasattr(scanner, "container_args")
965+
)
966+
967+
if container_image and container_capable and self._is_docker_available():
954968
logger.debug(
955969
"Backend '%s': using container for '%s' (image=%s)",
956970
backend,
@@ -969,6 +983,21 @@ def _run_scanner(
969983
scanner.name,
970984
exc,
971985
)
986+
elif container_image and not container_capable:
987+
if backend == "docker":
988+
raise RuntimeError(
989+
f"Scanner '{scanner.name}' has a container_image "
990+
f"but no build_args/container_args method, and "
991+
f"backend is 'docker'. Implement build_args() or "
992+
f"set backend to 'auto'/'local' to use the "
993+
f"scanner's own scan() method."
994+
)
995+
logger.debug(
996+
"Backend 'auto': scanner '%s' has no build_args/"
997+
"container_args — deferring to scanner.scan() instead "
998+
"of the container path",
999+
scanner.name,
1000+
)
9721001

9731002
# docker backend requires containers — fail explicitly
9741003
if backend == "docker":

argus/linters/hadolint.py

Lines changed: 56 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,14 @@ class HadolintLinter:
2020
container_image = get_image("hadolint")
2121

2222
def scan(self, path: str, config: dict | None = None) -> ScanResult:
23-
"""Find Dockerfiles under path and lint each with hadolint."""
23+
"""Find Dockerfiles under *path* and lint them all in one hadolint invocation.
24+
25+
Hadolint accepts multiple file paths on its CLI (``hadolint
26+
file1 file2 ...``) and emits a single JSON array spanning every
27+
file's findings. Doing one batched call beats spawning
28+
``len(dockerfiles)`` subprocesses by N startup costs and keeps
29+
the per-finding ``file`` field intact in the parsed output.
30+
"""
2431
config = config or {}
2532
target = Path(path)
2633

@@ -31,12 +38,37 @@ def scan(self, path: str, config: dict | None = None) -> ScanResult:
3138
metadata={"info": "No Dockerfiles found"},
3239
)
3340

34-
all_findings: list[Finding] = []
35-
for dockerfile in dockerfiles:
36-
findings = self._lint_file(dockerfile, config)
37-
all_findings.extend(findings)
41+
cmd = self._build_command(dockerfiles, config)
42+
result = subprocess.run(cmd, capture_output=True, text=True)
43+
44+
# hadolint exits 0 when clean, non-zero when findings exist —
45+
# both are the happy path. Empty stdout means a real error
46+
# (binary missing, parse failure inside hadolint, etc.).
47+
if not result.stdout.strip():
48+
return ScanResult(
49+
scanner=self.name,
50+
metadata={
51+
"execution_failed": True,
52+
"execution_failure_reason": (
53+
f"hadolint produced no output (exit={result.returncode}). "
54+
f"stderr: {(result.stderr or '').strip()[:400]}"
55+
),
56+
},
57+
)
58+
59+
try:
60+
data = json.loads(result.stdout)
61+
except json.JSONDecodeError as exc:
62+
return ScanResult(
63+
scanner=self.name,
64+
metadata={
65+
"execution_failed": True,
66+
"execution_failure_reason": f"Invalid JSON from hadolint: {exc}",
67+
},
68+
)
3869

39-
return ScanResult(scanner=self.name, findings=all_findings)
70+
findings = [self._parse_item(item) for item in data]
71+
return ScanResult(scanner=self.name, findings=findings)
4072

4173
def is_available(self) -> bool:
4274
"""Check if hadolint is installed."""
@@ -58,46 +90,37 @@ def _find_dockerfiles(self, target: Path) -> list[Path]:
5890
return [target]
5991
return sorted(target.rglob("Dockerfile*"))
6092

61-
def _lint_file(
62-
self, dockerfile: Path, config: dict
63-
) -> list[Finding]:
64-
"""Run hadolint on a single Dockerfile and parse results."""
65-
cmd = self._build_command(dockerfile, config)
66-
67-
result = subprocess.run(cmd, capture_output=True, text=True)
68-
69-
if not result.stdout.strip():
70-
return []
71-
72-
try:
73-
data = json.loads(result.stdout)
74-
except json.JSONDecodeError:
75-
return []
76-
77-
return [self._parse_item(item, dockerfile) for item in data]
78-
7993
def _build_command(
80-
self, dockerfile: Path, config: dict
94+
self, dockerfiles: list[Path], config: dict
8195
) -> list[str]:
82-
"""Build the hadolint CLI command."""
96+
"""Build a single hadolint command covering every Dockerfile.
97+
98+
Hadolint takes multiple file arguments and emits one combined
99+
JSON array — far cheaper than spawning a process per file.
100+
"""
83101
cmd = ["hadolint", "--format", "json"]
84102

85103
config_file = config.get("config_file")
86104
if config_file:
87105
cmd.extend(["--config", config_file])
88106

89-
ignore_rules = config.get("ignore_rules", [])
90-
for rule in ignore_rules:
107+
for rule in config.get("ignore_rules", []) or []:
91108
cmd.extend(["--ignore", rule])
92109

93-
cmd.append(str(dockerfile))
110+
cmd.extend(str(p) for p in dockerfiles)
94111
return cmd
95112

96-
def _parse_item(self, item: dict, dockerfile: Path) -> Finding:
97-
"""Convert a single hadolint JSON result into a Finding."""
113+
def _parse_item(self, item: dict) -> Finding:
114+
"""Convert a single hadolint JSON result into a Finding.
115+
116+
Hadolint emits the source file as ``item["file"]`` when it ran
117+
against multiple paths — we use that directly instead of
118+
threading the path in via the caller.
119+
"""
98120
rule_code = item.get("code", "UNKNOWN")
99121
line_num = item.get("line", 0)
100-
location = f"{dockerfile}:{line_num}"
122+
dockerfile = item.get("file", "")
123+
location = f"{dockerfile}:{line_num}" if dockerfile else None
101124

102125
return Finding(
103126
id=rule_code,
@@ -109,6 +132,6 @@ def _parse_item(self, item: dict, dockerfile: Path) -> Finding:
109132
metadata={
110133
"level": item.get("level", ""),
111134
"column": item.get("column", 0),
112-
"file": str(dockerfile),
135+
"file": dockerfile,
113136
},
114137
)

argus/tests/test_engine.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,82 @@ def test_local_backend_fails_if_unavailable(self):
222222
assert len(summary.results) == 1
223223
assert summary.results[0].metadata.get("execution_failed") is True
224224

225+
def test_auto_backend_defers_to_scan_when_no_build_args(self, monkeypatch):
226+
"""Scanners with a custom ``scan()`` flow but no ``build_args``/
227+
``container_args`` (e.g. linters that walk the workspace and
228+
invoke their tool per file) should defer to ``scanner.scan()``
229+
instead of the engine's container path.
230+
231+
Regression for the lint-dockerfile bug: HadolintLinter has
232+
``container_image`` set but no ``build_args``, so the engine
233+
used to AttributeError inside ``_run_in_container`` and the
234+
scanner silently disappeared from results.
235+
"""
236+
engine = self._make_engine(backend="auto")
237+
# Pretend Docker is available so the engine would have chosen
238+
# the container path if it could.
239+
monkeypatch.setattr(engine, "_is_docker_available", lambda: True)
240+
241+
captured: dict = {}
242+
243+
class CustomScanScanner:
244+
name = "custom"
245+
container_image = "example/custom:1.0"
246+
# Deliberately no build_args or container_args.
247+
248+
def scan(self, path, config=None):
249+
captured["scan_called"] = (path, config)
250+
return ScanResult(
251+
scanner=self.name,
252+
findings=[Finding(
253+
id="X", severity=Severity.LOW, title="from scan()",
254+
)],
255+
)
256+
257+
def is_available(self):
258+
return True
259+
260+
def install_command(self):
261+
return None
262+
263+
engine.register_scanner(CustomScanScanner())
264+
summary = engine.run(scanner_names=["custom"])
265+
266+
# scan() was called, container path was bypassed, findings flow through.
267+
assert captured.get("scan_called") is not None
268+
assert len(summary.results) == 1
269+
assert len(summary.results[0].findings) == 1
270+
assert summary.results[0].findings[0].title == "from scan()"
271+
272+
def test_docker_backend_rejects_scanner_without_build_args(self, monkeypatch):
273+
"""``backend: docker`` must fail loudly when a scanner has a
274+
container image but no way to run in one — the user explicitly
275+
opted into container-only execution and silent fallback would
276+
violate that contract."""
277+
engine = self._make_engine(backend="docker")
278+
monkeypatch.setattr(engine, "_is_docker_available", lambda: True)
279+
280+
class CustomScanScanner:
281+
name = "custom"
282+
container_image = "example/custom:1.0"
283+
284+
def scan(self, path, config=None):
285+
return ScanResult(scanner=self.name)
286+
287+
def is_available(self):
288+
return False
289+
290+
def install_command(self):
291+
return None
292+
293+
engine.register_scanner(CustomScanScanner())
294+
summary = engine.run(scanner_names=["custom"])
295+
# Surfaces as a failure row with the loud error.
296+
assert len(summary.results) == 1
297+
meta = summary.results[0].metadata
298+
assert meta.get("execution_failed") is True
299+
assert "build_args" in meta.get("execution_failure_reason", "")
300+
225301
def test_resolve_image_no_registry(self):
226302
engine = self._make_engine(registry="")
227303
scanner = MockScanner("trivy", container_image="aquasec/trivy:0.58.0")

docs/developer/SDK-ROADMAP.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -624,6 +624,45 @@ All engine, scanner, and testing issues from the migration have been resolved.
624624

625625
---
626626

627+
## FileDiscoveryScanner Template
628+
629+
Linters (`lint-yaml`, `lint-json`, `lint-python`, `lint-javascript`, `lint-dockerfile`, `lint-terraform`) and a few security scanners share a shape that doesn't fit the standard `build_args(ScanPaths) → list[str]` contract introduced in PR #117: they need to **discover files of a specific shape under a workspace, then run their tool against those file paths** (not against the workspace as a whole). Today each one rolls its own `_find_*` walk + per-file subprocess loop in its `scan()` method, which has three problems:
630+
631+
1. **Multi-subprocess inefficiency.** `HadolintLinter.scan()` (pre-PR #120) ran `subprocess.run(['hadolint', dockerfile])` once per Dockerfile in a Python loop — N startup costs for N files. Most of these tools accept a list of paths in a single invocation (`hadolint file1 file2 ...`), so the loop is unnecessary.
632+
2. **No container-execution support.** The custom `scan()` flows hardcode `subprocess.run(['<binary>', ...])` and crash with `FileNotFoundError` when the binary isn't installed locally. The engine's container backend was added later and never extended to cover the discovery shape.
633+
3. **Discovery patterns are duplicated.** Every linter implements its own `_find_dockerfiles` / `_find_yaml_files` / etc. with subtly different exclusion logic.
634+
635+
**Proposed shape**: a `FileDiscoveryScanner` mixin or template (analogous to `argus.core.scanner_template.run_subprocess_scan`) that:
636+
637+
```python
638+
class HadolintLinter(FileDiscoveryScanner):
639+
name = "lint-dockerfile"
640+
file_glob = "Dockerfile*" # workspace-relative pattern
641+
container_image = get_image("hadolint")
642+
643+
def build_args(self, files: list[str], output: str) -> list[str]:
644+
# Tool that accepts multiple file paths in one invocation.
645+
return ["hadolint", "--format", "json", *files]
646+
647+
def parse_results(self, output_path) -> list[Finding]:
648+
...
649+
```
650+
651+
The shared template handles:
652+
- Workspace walk + glob matching with the standard exclusion set
653+
- Single subprocess call (or `docker run`) with all matched files
654+
- Container vs local routing via the existing `is_available()` / `container_image` mechanism
655+
- Output file lifecycle + `parse_results` dispatch
656+
- Empty-discovery case (return clean ScanResult with `no <files> found` info, not a failure row)
657+
658+
**Why deferred**: PR #120's engine fallback (`scanner.scan()` is honored when `build_args` is missing) gives every scanner a working escape hatch today, and PR #119's failure-row contract makes any remaining edge case visible. The template is a quality-of-life improvement for adding new linters that don't fit the standard shape — worth the design conversation but not load-bearing for any current functionality.
659+
660+
**Trigger to revisit**: when the second new linter contributor copy-pastes the file-discovery boilerplate from `HadolintLinter`. At that point the duplication has earned the abstraction.
661+
662+
**If/when we ship it**: likely `argus/core/file_discovery_scanner.py` exporting the template, plus migrations for the existing 6 linters + any security scanner with a similar shape (e.g. clamav's recursive directory scan). Each migration is a self-contained PR.
663+
664+
---
665+
627666
## Secret Redaction Hardening
628667

629668
The current redaction model (per-scanner, at the parser) is documented in [`docs/mcp.md` → Secrets handling](../mcp.md). Each scanner that emits potentially-sensitive content audits its own output and replaces secret-bearing fields with the `<redacted>` placeholder before the `Finding` is built. Downstream consumers (terminal reporter, JSON / Markdown / SARIF exports, MCP tool responses, the LLM context window) therefore never see raw values.

0 commit comments

Comments
 (0)