Scaffold agent eval harness skills by astefanutti · Pull Request #1 · opendatahub-io/agent-eval-harness

astefanutti · 2026-04-03T17:43:15Z

Generic evaluation framework for agents and skills: analyze, run, score, and improve skills across agent harnesses (Claude Code, OpenCode, ...) and models.

Features

Natural language evaluation config — Describe dataset structure and skill outputs in plain English in eval.yaml. LLM agents interpret the descriptions to prepare, collect, and score.
Agent-agnostic execution — Pluggable runner abstraction (EvalRunner ABC). Ships with Claude Code; extensible to OpenCode, Agent SDK, or any CLI-based agent. Generic permissions (allow/deny) plus runner-specific options.
Flexible judges — Inline Python checks, LLM judges with prompt files and context, or external code modules.
MLflow integration — via a dedicated /eval-mlflow skill. Sync datasets, log metrics, attach judge feedback to traces. The eval pipeline works without MLflow.
Automated refinement — /eval-optimize reads failures, edits the skill, re-runs, and checks for regressions in a loop.

Skills

Skill	Purpose
`/eval-setup`	Dependencies, MLflow, API keys, directories
`/eval-analyze`	Analyze skill, generate eval.yaml + eval.md
`/eval-run`	Prepare, execute, collect, score, report
`/eval-mlflow`	Sync dataset, log results, trace feedback
`/eval-optimize`	Iterative improvement based on failures

coderabbitai · 2026-04-03T17:43:29Z

📝 Walkthrough

Walkthrough

This pull request adds an Agent Eval Harness: package scaffolding, CLI entrypoints, runner abstractions (EvalRunner, RunResult, ClaudeCodeRunner), YAML-based EvalConfig, MLflow utilities, a filesystem state CLI, skill specs and scripts for eval-setup, eval-analyze, eval-run, eval-mlflow, eval-optimize, and supporting scripts for workspace prep, execution, collection, and scoring. Documentation and packaging metadata (README.md, CLAUDE.md, pyproject.toml, .gitignore, eval.yaml, plugin manifest) are included.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Security Issues

CWE-94: Improper Control of Generation of Code (Code Injection)
skills/eval-run/scripts/score.py (external judge loading): The load_judges() function dynamically imports and invokes user-specified module/function from JudgeConfig without validation. An attacker controlling eval.yaml can cause arbitrary code execution. Mitigations: implement an allowlist of modules/functions, run external judges in a sandboxed subprocess with limited privileges, or require an explicit approval step for external judges.

CWE-78: Improper Neutralization of Special Elements used in an OS Command (OS Command Injection)
agent_eval/agent/claude_code.py: ClaudeCodeRunner.run_skill() constructs a subprocess command that incorporates skill_name and args derived from external input. Although subprocess is used, unvalidated arguments may expose unexpected CLI behavior. Mitigations: validate/normalize skill_name and args (restrict characters, length), avoid passing user-controlled strings that can be interpreted as flags, and prefer explicit argument lists built from known-safe tokens.

CWE-22: Improper Limitation of a Pathname to a Restricted Directory (Path Traversal)
skills/eval-run/scripts/collect.py: Copying files from workspace into per-case outputs lacks strict path resolution checks. Malicious filenames or misconfigured outputs[].path could write outside intended directories. Mitigations: resolve paths with Path.resolve() and enforce they remain within workspace/output roots before copying; reject paths containing .. or absolute roots.

CWE-502: Deserialization of Untrusted Data
Multiple modules load YAML from user-controlled files (EvalConfig.from_yaml(), discover.py, state.py, scoring/collection scripts). While yaml.safe_load() is used, downstream logic assumes specific shapes. Mitigations: perform explicit schema validation (required keys, types, allowed fields) after parsing and reject or fail loudly on unexpected structures.

Code Quality & Logic Gaps

Input Validation:

agent_eval/__main__.py:_check_eval_md(): Does not robustly handle unreadable SKILL.md files—file read failures can be silent or produce misleading messages. Add explicit read/error handling.
skills/eval-run/scripts/workspace.py: No validation that --case-filter is a valid regex before applying it; invalid patterns can crash. Validate/compile the regex and report errors.
agent_eval/config.py: EvalConfig.from_yaml() accepts any runner string; it should validate against available RUNNERS to catch misconfiguration early.

Error Handling:

agent_eval/mlflow/experiment.py: Broad try/except blocks suppress failures without returning useful status; callers cannot distinguish no-op from success. Return explicit booleans or raise controlled exceptions.
skills/eval-run/scripts/score.py:load_judges(): External import/call errors are swallowed or reduced to opaque entries; preserve exception details in logs and surface them in summary for debuggability.
agent_eval/agent/claude_code.py:run_skill(): subprocess.TimeoutExpired branch returns exit_code=-1 but stderr lacks a clear timeout message; ensure stderr includes a descriptive timeout/kill reason.

Hard-Coded Limits:

agent_eval/agent/claude_code.py: Uses a fixed short wait timeout after streaming (30s); this can deadlock if stderr is large. Make the waiting behavior configurable and consider non-blocking reads or time-limited join semantics.
skills/eval-run/scripts/score.py: ThreadPoolExecutor size and resource usage are fixed; expose a configurable worker count and add safeguards for very large case sets.

Path Handling:

agent_eval/__main__.py:_write_eval_md(): Builds paths via string operations; switch to pathlib to avoid cross-platform issues and ensure atomic writes (tmpfile + rename).
Multiple scripts assume cwd is the project root; require/normalize absolute paths for config/workspace inputs or document/validate expected working directory.

Test Coverage:
No tests included. Large, logic-dense modules (notably skills/eval-run/scripts/score.py and agent_eval/agent/claude_code.py) should have unit and integration tests to reduce regression risk.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Scaffold agent eval harness skills' is vague and generic. It uses 'Scaffold' which doesn't specify what is being created, and 'skills' is plural without clarity on which skills or their purpose. The title lacks specificity about the actual changes—it doesn't convey that this introduces a complete evaluation framework with configuration, CLI tools, MLflow integration, and multiple skill workflows.	Replace with a more specific title that captures the main deliverable, such as 'Add agent-eval-harness framework with evaluation skills' or 'Introduce eval harness with config, CLI, and MLflow integration' to better signal the scope and intent of the changeset.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (19)

agent_eval/__main__.py-116-123 (1)

116-123: ⚠️ Potential issue | 🟠 Major

Validate --data payload shape and fail fast when SKILL.md is missing.

Line 116 can parse to scalar/None, then Line 130 raises AttributeError. Also, when no SKILL.md is found (Line 118-123), an empty skill_hash is written, creating invalid eval metadata.

Proposed fix

-    data = yaml.safe_load(data_yaml)
+    data = yaml.safe_load(data_yaml) or {}
+    if not isinstance(data, dict):
+        print("Invalid --data: expected YAML mapping", file=sys.stderr)
+        sys.exit(1)
@@
     skill_hash = ""
@@
         if skill_path.exists():
-            skill_hash = hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]
+            skill_hash = hashlib.sha256(skill_path.read_bytes()).hexdigest()[:12]
             break
+    if not skill_hash:
+        print(f"STALE: skill {skill} not found", file=sys.stderr)
+        sys.exit(1)
@@
-    fm.update(data.get("frontmatter", data))
+    frontmatter = data.get("frontmatter", data)
+    if not isinstance(frontmatter, dict):
+        print("Invalid --data.frontmatter: expected mapping", file=sys.stderr)
+        sys.exit(1)
+    fm.update(frontmatter)

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 130-131

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/__main__.py` around lines 116 - 123, The parsed YAML payload in
data (from yaml.safe_load(data_yaml)) can be scalar/None and leads to
AttributeError when accessing data["skill"]; update the code to validate that
data is a dict and contains a non-empty "skill" string (e.g., check
isinstance(data, dict) and "skill" in data) and fail fast with a clear error if
not. For the SKILL.md lookup, change the skill discovery logic that uses
skill_hash/skill_path so it only computes hashlib.md5 when SKILL.md exists and
instead of leaving skill_hash as an empty string, raise an explicit error if no
SKILL.md is found for the given skill name (referencing variables skill_hash,
skill_path, and the skill value) so invalid eval metadata is never written.

agent_eval/state.py-50-51 (1)

50-51: ⚠️ Potential issue | 🟠 Major

Guard positional arguments before indexing sys.argv.

Line 51, Line 58, Line 68, Line 80, and Line 86 assume sys.argv[2] exists; malformed invocations crash with IndexError instead of emitting a controlled CLI error.

Proposed fix

+def _require_args(min_len: int, usage: str) -> None:
+    if len(sys.argv) < min_len:
+        print(f"Usage: {usage}", file=sys.stderr)
+        sys.exit(1)
+
 def main():
@@
     if cmd == "init":
+        _require_args(3, "state.py init <path> [key=value ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "set":
+        _require_args(3, "state.py set <path> [key=value ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "read":
+        _require_args(3, "state.py read <path>")
         path = sys.argv[2]
@@
     elif cmd == "write-ids":
+        _require_args(3, "state.py write-ids <path> [ID ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "read-ids":
+        _require_args(3, "state.py read-ids <path>")
         path = Path(sys.argv[2])

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 57-58, 67-68, 79-80, 85-86

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 50 - 51, The code indexes sys.argv[2]
directly in the command-handling branches (e.g., when cmd == "init" calling
Path(sys.argv[2])) which crashes on malformed invocations; add guards that
verify enough positional args before indexing (for example check len(sys.argv) >
2 or use argparse to parse required positional arguments) and emit a clear CLI
error/usage message instead of letting IndexError propagate; update every branch
that reads sys.argv[2] (the branches handling "init" and the other commands
referencing sys.argv[2]) to perform the guard and return a controlled error/exit
on missing args.

agent_eval/mlflow/experiment.py-53-56 (1)

53-56: ⚠️ Potential issue | 🟠 Major

Add a timeout to the MLflow subprocess invocation.

Line 53 can block forever if mlflow hangs, stalling the eval pipeline and causing cascading failures in orchestration.

Proposed fix

-    result = subprocess.run(cmd, capture_output=True, text=True)
+    try:
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
+    except subprocess.TimeoutExpired as exc:
+        print(f"MLflow autolog timed out: {exc}", file=sys.stderr)
+        return False

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 53 - 56, The subprocess call
that runs the MLflow setup (the line using result = subprocess.run(cmd,
capture_output=True, text=True)) can hang; add a timeout argument (e.g.
timeout=some_seconds) to subprocess.run and wrap the call in a try/except for
subprocess.TimeoutExpired to handle timeouts: on timeout, print or log an error
including the timeout duration and return False, and keep the existing stderr
handling for non-zero return codes so the function returns True only on
successful completion.

skills/eval-analyze/scripts/discover.py-106-106 (1)

106-106: ⚠️ Potential issue | 🟠 Major

Replace MD5 with SHA-256 for file freshness hashing (CWE-327).

MD5 is cryptographically broken with practical collision attacks feasible in seconds to minutes on modern hardware. An attacker can craft two different skill files with identical MD5 hashes, allowing malicious skill content to bypass freshness detection. This violates NIST and IETF (RFC 6151) guidance against using MD5 where collision resistance is required—which applies to file integrity verification against tampering. Recent exploitation: CVE-2024-3596 (Blast-RADIUS) demonstrated real-world MD5 collision exploits in production systems.
Fix
-            current_hash = hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]
+            current_hash = hashlib.sha256(skill_path.read_bytes()).hexdigest()[:12]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/discover.py` at line 106, The line computing
current_hash uses hashlib.md5, which is vulnerable; replace MD5 with a SHA-256
digest by using hashlib.sha256 on the file bytes and take an appropriate prefix
(or full hex) for freshness checks. Update the expression that sets current_hash
(currently using hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]) to use
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a safer-length slice) so
the freshness comparison continues to work but with collision-resistant hashing.

agent_eval/config.py-124-127 (1)

124-127: ⚠️ Potential issue | 🟠 Major

project_root returns the wrong directory for non-CWD config loads.

Line 127 returns Path.cwd() but the docstring says “where eval.yaml lives.” This breaks relative path resolution whenever --config points outside the current working directory.

Remediation patch

     `@property`
     def project_root(self) -> Path:
         """Project root (where eval.yaml lives)."""
-        return Path.cwd()
+        return self.config_dir

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 124 - 127, The project_root property
currently always returns Path.cwd(), which is wrong when the config was loaded
from a file outside CWD; change project_root to derive the directory from the
loaded config file path (e.g., return Path(self.config_path).parent if
self.config_path or self._config_path exists), falling back to Path.cwd() when
no config path is available; update the property in config.py (project_root) to
use that attribute so relative paths resolve relative to where eval.yaml
actually lives.

skills/eval-run/scripts/workspace.py-111-123 (1)

111-123: ⚠️ Potential issue | 🟠 Major

Handle YAML/JSON parse errors per case instead of aborting the run.

Line 113-120 can raise parser exceptions and terminate workspace.py on one malformed case file.

Remediation patch

 def _read_input(case_dir):
@@
         if name.is_file() and name.suffix in (".yaml", ".yml"):
-            with open(name) as f:
-                data = yaml.safe_load(f)
+            try:
+                with open(name, encoding="utf-8") as f:
+                    data = yaml.safe_load(f)
+            except yaml.YAMLError:
+                continue
             if isinstance(data, dict):
                 return data
         elif name.is_file() and name.suffix == ".json":
             import json
-            with open(name) as f:
-                data = json.load(f)
+            try:
+                with open(name, encoding="utf-8") as f:
+                    data = json.load(f)
+            except json.JSONDecodeError:
+                continue
             if isinstance(data, dict):
                 return data

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 111 - 123, The loop that
reads case files (the for name in sorted(case_dir.iterdir()) block) can raise
YAML/JSON parser exceptions and abort the run; wrap each file's parsing in a
try/except that catches yaml.YAMLError, json.JSONDecodeError (and a broad
Exception fallback), log a warning including the filename (name) and error, and
continue to the next file instead of letting the exception propagate; only
return data when isinstance(data, dict) and skip non-dict or failed parses.

agent_eval/config.py-81-117 (1)

81-117: ⚠️ Potential issue | 🟠 Major

Harden YAML shape validation to avoid runtime crashes.

Line 85/98/105 assumes mapping/list item types. Malformed YAML (e.g., scalar outputs, non-dict judge entries) will crash with AttributeError instead of a controlled config error.

Remediation patch

-        with open(path) as f:
+        with open(path, encoding="utf-8") as f:
             raw = yaml.safe_load(f) or {}
+        if not isinstance(raw, dict):
+            raise ValueError("Config root must be a mapping/object")
@@
-        dataset = raw.get("dataset", {})
+        dataset = raw.get("dataset", {})
+        if not isinstance(dataset, dict):
+            raise ValueError("'dataset' must be a mapping/object")
@@
-        for o in raw.get("outputs", []):
+        outputs = raw.get("outputs", [])
+        if not isinstance(outputs, list):
+            raise ValueError("'outputs' must be a list")
+        for o in outputs:
+            if not isinstance(o, dict):
+                raise ValueError("Each outputs entry must be a mapping/object")
             config.outputs.append(OutputConfig(
@@
-        for j in raw.get("judges", []):
+        judges = raw.get("judges", [])
+        if not isinstance(judges, list):
+            raise ValueError("'judges' must be a list")
+        for j in judges:
+            if not isinstance(j, dict):
+                raise ValueError("Each judges entry must be a mapping/object")
             config.judges.append(JudgeConfig(

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 81 - 117, The YAML parsing assumes
specific shapes and will AttributeError on malformed values; validate types
before iterating or accessing mapping keys: ensure dataset = raw.get("dataset",
{}) is a dict (otherwise raise a clear ConfigError/ValueError), ensure
raw.get("outputs", []) is a list and each output item is a dict before creating
OutputConfig in the outputs loop, and ensure raw.get("judges", []) is a list and
each judge item is a dict before creating JudgeConfig in the judges loop; add
concise error messages that include the top-level config name (from cls(...)) or
path to help debugging and bail early instead of letting AttributeError
propagate.

skills/eval-run/scripts/collect.py-43-45 (1)

43-45: ⚠️ Potential issue | 🟠 Major

Validate case_order.yaml shape before use.

Line 44 can load None/scalar/object; later len(case_order) and index access then fail unpredictably.

Remediation patch

     with open(order_path) as f:
         case_order = yaml.safe_load(f)
+    if not isinstance(case_order, list):
+        print("ERROR: case_order.yaml must contain a list", file=sys.stderr)
+        sys.exit(1)

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` around lines 43 - 45, The YAML load of
case_order (from order_path) can return None or a non-list which later causes
len(case_order) and index access to fail; update the code in collect.py after
the yaml.safe_load call to validate that case_order is a list (e.g.,
isinstance(case_order, list)) and raise a clear ValueError or exit with an
explanatory message including order_path when it is None or not a list so
downstream uses of case_order (len(case_order), indexing) are safe.

skills/eval-run/SKILL.md-23-27 (1)

23-27: ⚠️ Potential issue | 🟠 Major

Respect parsed config path; avoid hardcoded eval.yaml.

Line 26 and Line 44 bypass the user-supplied config path, so runs can validate one file and execute another.

Proposed fix

-Check if eval.yaml exists:
+Check if <config> exists:
@@
-test -f eval.yaml && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
+test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
@@
-Read `dataset.path` from eval.yaml, verify the directory exists and contains case subdirectories.
+Read `dataset.path` from <config>, verify the directory exists and contains case subdirectories.

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 44-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/SKILL.md` around lines 23 - 27, The check currently hardcodes
"eval.yaml" (seen on the test command lines around line 26 and 44) which ignores
the user-supplied config path; update the commands to use the parsed config path
variable instead of the literal "eval.yaml" (e.g., replace occurrences of
"eval.yaml" with the variable that holds the supplied config path such as
config_path/CONFIG_PATH/parsedConfigPath in the script or template), ensuring
both the existence check and any subsequent uses reference that variable so
validation and execution use the same file.

skills/eval-optimize/SKILL.md-42-43 (1)

42-43: ⚠️ Potential issue | 🟠 Major

Remove unsupported --score from /eval-run invocations.

The workflow calls /eval-run with a flag that is not defined in the eval-run skill interface, so this step is brittle/failing by construction.

Proposed fix

-Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N> --score
+Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N>
@@
-Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N+1> --score --baseline <id>-iter-<N>
+Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N+1> --baseline <id>-iter-<N>

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 60-61

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` around lines 42 - 43, The /eval-run example
invocation in SKILL.md includes an unsupported --score flag (e.g., "Use the
Skill tool to invoke /eval-run --config <config> --model <model> --run-id
<id>-iter-<N> --score"); remove the --score flag from that command and any other
occurrences (the other instance noted around the same block) so the listed Skill
invocation matches the eval-run skill interface, and update the example text to
use "/eval-run --config <config> --model <model> --run-id <id>-iter-<N>"
instead.

skills/eval-run/scripts/execute.py-45-45 (1)

45-45: ⚠️ Potential issue | 🟠 Major

High: --disallowed-tools is silently ignored, enabling policy bypass (CWE-693).

At Line 45 the flag is accepted, but it is never enforced before Line 71-Line 79 execution. Exploit scenario: an operator believes risky tools are blocked, but the run executes without those restrictions.
Proposed fix
@@
     parser.add_argument("--disallowed-tools", default=None)
     args = parser.parse_args()
+
+    if args.disallowed_tools:
+        parser.error(
+            "--disallowed-tools is not enforced by execute.py yet; refusing insecure no-op"
+        )
As per coding guidelines, "**: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code)".

Also applies to: 71-79
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/execute.py` at line 45, The CLI flag
"--disallowed-tools" is parsed (parser.add_argument("--disallowed-tools")) but
never enforced, allowing policy bypass; parse args.disallowed_tools into a
normalized list (split comma/whitespace), validate entries, and before the
execution block that runs tools (the code around the current Line 71-79
execution), check and either remove those tools from the tool list or abort with
an error if a disallowed tool is requested; update any call sites that invoke
tool execution (e.g., the function or block that uses parsed args to select
tools) to consult args.disallowed_tools and enforce the restriction.

agent_eval/agent/base.py-35-45 (1)

35-45: ⚠️ Potential issue | 🟠 Major

Use structured argv instead of raw args: str to reduce injection risk (CWE-88).

A raw string argument encourages downstream shell-style composition. If any runner implementation interpolates this into a command, crafted input can alter execution semantics.
Proposed refactor
-from typing import Optional
+from typing import Optional, Sequence
@@
-        args: str,
+        args: Sequence[str],
As per coding guidelines, "**: REVIEW PRIORITIES: 1. Security vulnerabilities ..." and "2. Architectural issues and anti-patterns".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/base.py` around lines 35 - 45, The run_skill signature
currently accepts a raw string args which risks command injection; change the
parameter to a structured sequence (e.g., argv: Sequence[str] or List[str]) in
the run_skill method and any overrides/implementations so callers pass
individual argument tokens, update all call sites to supply a list of args
instead of a joined string, and ensure any runner logic (e.g., subprocess
invocations) uses the argv list directly (no shell=True or string concatenation)
and documents the new signature in RunResult-related usage.

skills/eval-setup/scripts/check_env.py-64-74 (1)

64-74: ⚠️ Potential issue | 🟠 Major

Don't report MLflow as OK when it was never validated.

A non-empty MLFLOW_TRACKING_URI is accepted without any connectivity check, and the unset branch still appends ("mlflow_server", True, ...). That turns typos, dead remote URIs, and the no-server case into a green preflight result, which then lets /eval-setup skip actual tracking configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/scripts/check_env.py` around lines 64 - 74, The code
currently treats any non-empty MLFLOW_TRACKING_URI as "OK" without validating
connectivity; update the logic so that when MLFLOW_TRACKING_URI is present you
still call _check_mlflow_server (or otherwise validate the URI) to determine
mlflow_ok and set mlflow_detail based on the result (e.g., actual URI on success
or an error/reachability message on failure); also change the checks.append call
to use the computed mlflow_ok (not the hardcoded True) and include the
validation result in the message so typos or unreachable URIs are reported
correctly.

skills/eval-mlflow/SKILL.md-86-99 (1)

86-99: ⚠️ Potential issue | 🟠 Major

Persist and query a stable harness run identifier before touching MLflow.

This flow always starts a fresh MLflow run, searches traces only by experiment, and caps the trace search at 100 results. Reruns will duplicate MLflow runs, and feedback attachment can hit the wrong traces or miss later ones in larger evals. Tag runs and traces with the harness run_id/case_id, reuse the existing run when present, and paginate/filter the trace query.

Also applies to: 115-123, 139-140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-mlflow/SKILL.md` around lines 86 - 99, Before calling
mlflow.start_run, persist and look up the harness run identifier (run_id and
case_id) and use it to find an existing MLflow run (querying by tag on the
harness run_id and using paginated/filtering trace results rather than a fixed
100 limit); if a matching run exists reuse it with
mlflow.start_run(run_id=existing_run_id) otherwise start a new run and
immediately mlflow.set_tag('harness_run_id', run_id) and tag each trace with
'harness_case_id' where appropriate; ensure all
mlflow.log_param/mlflow.log_metric/mlflow.log_artifact calls reference that
stable run and remove hard-coded placeholders so reruns attach to the correct
existing run and traces are discovered via paginated/filtered queries.

skills/eval-setup/SKILL.md-16-18 (1)

16-18: ⚠️ Potential issue | 🟠 Major

Don't convert an invalid setup state into a false "ready".

check_env.py does not validate the tracing hook, so the Step 1 shortcut can skip Steps 3-5 while .claude/settings.json is still missing. These later snippets also suppress EvalConfig.from_yaml() failures and retry without --config, which turns a malformed eval.yaml into a green no-config result. Fail fast on invalid config, and only short-circuit once tracing is already configured.

Also applies to: 97-107, 114-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/SKILL.md` around lines 16 - 18, check_env.py currently
short-circuits to "ready" without verifying the tracing hook or valid config;
update the readiness shortcut so it only returns success after confirming
tracing is configured (presence and validity of .claude/settings.json or the
tracing hook) and do not swallow EvalConfig.from_yaml() exceptions — let
from_yaml raise and fail fast instead of falling back to retry without --config.
Specifically: in check_env.py ensure the Step 1 fast-path validates tracing
setup (the tracing hook check and .claude/settings.json) before skipping Steps
3–5, and in the code paths that call EvalConfig.from_yaml() do not catch/ignore
parsing errors and retry silently with no --config; propagate the error to
surface malformed eval.yaml and treat it as a non-ready state.

skills/eval-mlflow/SKILL.md-76-97 (1)

76-97: ⚠️ Potential issue | 🟠 Major

regressions_detected is not available in summary.yaml.

skills/eval-run/scripts/score.py only merges run_id, judges, per_case, and pairwise into the summary; regression detection is printed to stdout and never persisted. This step needs to recompute regressions from config.thresholds or have the scorer write them explicitly before it can set this tag correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-mlflow/SKILL.md` around lines 76 - 97, The MLflow logging step
cannot set the 'regressions_detected' tag because summary.yaml lacks that field;
update either the scorer or the MLflow step: modify
skills/eval-run/scripts/score.py (the function that builds the summary) to
compute regressions by comparing each judge's metric(s) against
config.thresholds and persist a boolean/enum 'regressions_detected' into the
summary before writing summary.yaml, or keep score.py as-is and recompute
regressions in SKILL.md's MLflow logging block by loading config.thresholds and
the judges' mean/pass_rate from summary.yaml and deriving the same boolean to
pass to mlflow.set_tag('regressions_detected', ...). Ensure you reference the
same judge metric names used in the summary and config.thresholds so the
computed tag matches the scorer's logic.

skills/eval-setup/SKILL.md-78-82 (1)

78-82: ⚠️ Potential issue | 🟠 Major

[Major][CWE-94] Stop embedding shell-expanded values inside python3 -c.

$(pwd) and MLFLOW_TRACKING_URI are injected into single-quoted Python literals here. A quote or newline in either value breaks the command, and a crafted tracking URI can execute arbitrary Python on the operator machine. Read both values from Python instead.

Suggested fix

-```bash
-python3 -c "
-from agent_eval.mlflow.experiment import setup_autolog
-setup_autolog('$(pwd)', tracking_uri='${MLFLOW_TRACKING_URI:-http://127.0.0.1:5000}')
-"
-```
+```bash
+python3 - <<'PY'
+import os
+from agent_eval.mlflow.experiment import setup_autolog
+
+setup_autolog(
+    os.getcwd(),
+    tracking_uri=os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5000"),
+)
+PY
+```

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/SKILL.md` around lines 78 - 82, The current shell snippet
injects $(pwd) and MLFLOW_TRACKING_URI into a single-quoted Python literal which
is unsafe; change the invocation to run a Python script that reads values inside
Python (use os.getcwd() for the working dir and
os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5000") for the tracking
URI) and then call setup_autolog from agent_eval.mlflow.experiment with those
values (refer to setup_autolog to locate the call); use a here-doc or a small
Python file instead of python3 -c with shell-expanded values.

skills/eval-run/scripts/score.py-538-556 (1)

538-556: ⚠️ Potential issue | 🟠 Major

Handle missing thresholds before calling detect_regressions().

cmd_regression() passes config.thresholds straight through. When eval.yaml omits thresholds, this crashes with 'NoneType' object has no attribute 'items' instead of returning a clean "no thresholds configured" result.

Suggested fix

 def cmd_regression(args):
     config = EvalConfig.from_yaml(args.config)
+    if not config.thresholds:
+        print("No thresholds configured.")
+        return
     summary_path = RUNS_DIR / args.run_id / "summary.yaml"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 538 - 556, cmd_regression
currently passes config.thresholds directly to detect_regressions which breaks
when thresholds is None; update cmd_regression (after config =
EvalConfig.from_yaml(...)) to check if config.thresholds is falsy and either set
thresholds = {} before calling detect_regressions or short-circuit with a clear
message/exit indicating "no thresholds configured" so detect_regressions is
never invoked with None; reference config.thresholds, cmd_regression, and
detect_regressions when making the change.

agent_eval/agent/claude_code.py-79-111 (1)

79-111: ⚠️ Potential issue | 🟠 Major

Subprocess deadlock risk and timeout not enforced on process lifetime.

This pattern creates two critical issues:

Stderr deadlock (CWE-667): stdout is read in a blocking loop while stderr is drained only after the loop completes. If the subprocess fills the stderr buffer (default ~64KB on Linux) before closing stdout, it blocks writing while the parent is blocked reading stdout—classic deadlock.

Timeout ineffective: The hard-coded wait(timeout=30) applies only after stdout iteration finishes. If stdout flow stops while the process is hung, the loop blocks indefinitely. The timeout_s parameter is never applied to the process itself.

Fix: Drain stdout and stderr concurrently using threading or concurrent.futures, and apply timeout_s to proc.wait().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 79 - 111, The current
subprocess handling reads proc.stdout in a blocking loop and only reads
proc.stderr after, which risks deadlock and never applies the provided timeout_s
to the process lifetime; change the logic around the subprocess.Popen(...) call
and the subsequent I/O so that stdout and stderr are drained concurrently (e.g.,
spawn threads or use concurrent.futures to read proc.stdout and proc.stderr in
parallel or use proc.communicate with a timeout), ensure proc.wait or
proc.communicate is called with the timeout_s parameter, and preserve existing
behavior of parsing JSON lines and setting result_obj from proc.stdout
(references: the subprocess.Popen(...) call, proc.stdout loop,
proc.stderr.read(), proc.wait(timeout=30), and the timeout_s parameter).

🟡 Minor comments (4)

agent_eval/state.py-88-88 (1)
88-88: ⚠️ Potential issue | 🟡 Minor

Rename ambiguous loop variable l.

Line 88 uses l, which is ambiguous and hurts readability. Use line (or similar) to avoid confusion and satisfy linting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` at line 88, Rename the ambiguous list comprehension
variable `l` to a clearer name like `line` in the assignment that builds `ids`
(the expression using path.read_text().splitlines()); update the comprehension
from `[l.strip() for l in ... if l.strip()]` to use `line` so it becomes
`[line.strip() for line in ... if line.strip()]` to improve readability and
satisfy linters.
README.md-7-7 (1)
7-7: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced blocks (MD040).

These fences are missing language tags, which keeps markdownlint failing. Use explicit tags (text, bash, yaml, json) for each block.

Also applies to: 60-60, 68-68, 79-79, 294-294, 303-303, 314-314, 324-324, 330-330
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 7, The fenced code blocks in README.md are missing
language identifiers and failing markdownlint (MD040); update each
triple-backtick fence at the reported locations to include an explicit language
tag (e.g., ```text, ```bash, ```yaml, ```json) appropriate to the block contents
so markdownlint passes — search for the unnamed fenced code blocks and add the
correct language identifier for each occurrence.
skills/eval-run/SKILL.md-31-31 (1)
31-31: ⚠️ Potential issue | 🟡 Minor

Add language tags to fenced blocks (MD040).

These code fences should declare a language (for example text) to satisfy markdownlint.

Also applies to: 141-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/SKILL.md` at line 31, The markdown file SKILL.md contains
unnamed fenced code blocks that trigger markdownlint MD040; update each fenced
block by appending an appropriate language tag (e.g., ```text, ```bash,
```javascript) to the opening fence so all code fences declare a language and
satisfy the linter.
skills/eval-optimize/SKILL.md-12-12 (1)
12-12: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced blocks (MD040).

Use explicit fence languages (text or bash) for these blocks to clear markdownlint warnings.

Also applies to: 41-41, 59-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` at line 12, Replace the bare triple-backtick
fenced code blocks in SKILL.md with language-tagged fences (e.g., change ``` to
```text or ```bash) so markdownlint MD040 is satisfied; specifically update the
fenced blocks at the reported locations (the ``` fence at line 12 and the other
occurrences around lines 41 and 59) to use the appropriate language identifier
for each block.

🧹 Nitpick comments (2)

.gitignore (1)

1-9: Consider adding common Python development artifacts.

Standard development tools produce artifacts that should be ignored: .pytest_cache/, .coverage, .mypy_cache/, .ruff_cache/, .tox/, htmlcov/.

📦 Proposed additions

 # Python
 __pycache__/
 *.py[cod]
 *.egg-info/
 *.egg
 dist/
 build/
 .venv/
 venv/
+.pytest_cache/
+.coverage
+.mypy_cache/
+.ruff_cache/
+.tox/
+htmlcov/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 1 - 9, Update the project .gitignore to include
common Python dev artifact entries that are missing: add .pytest_cache/,
.coverage, .mypy_cache/, .ruff_cache/, .tox/, and htmlcov/ alongside the
existing entries (__pycache__/, *.py[cod], *.egg-info/, *.egg, dist/, build/,
.venv/, venv/); ensure these exact patterns are appended so test, coverage,
type-checker and formatter caches and test reports are ignored by VCS.

agent_eval/agent/__init__.py (1)

10-10: Sort __all__ to satisfy Ruff RUF022.

Line 10 is currently unsorted and will keep lint noise if Ruff is enforced.

Remediation patch

-__all__ = ["EvalRunner", "RunResult", "ClaudeCodeRunner", "RUNNERS"]
+__all__ = ["ClaudeCodeRunner", "EvalRunner", "RUNNERS", "RunResult"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/__init__.py` at line 10, The module-level __all__ list is
unsorted (currently contains "EvalRunner", "RunResult", "ClaudeCodeRunner",
"RUNNERS") which triggers Ruff RUF022; reorder the entries in __all__ to be
alphabetically sorted (e.g., "ClaudeCodeRunner", "EvalRunner", "RUNNERS",
"RunResult") so the symbol list is deterministic and lint-clean.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Around line 1-25: Update the .gitignore to explicitly exclude common secret
and credential files so keys and envs never get committed: add patterns for .env
and .env.* files, .secrets, credentials.json, secrets.json,
service_account.json, api_key* and *-key.*, *.pem, *.key, .netrc, .aws/
(including credentials), .gcp/, .azure/, .ssh/* (except known_hosts if desired),
and any local IDE/OS-specific secret files; also ensure mlflow.db and mlruns/
remain ignored and add any project-specific secret filenames used by the Claude
evaluation harness (e.g., claude_api_key.txt or eval_secrets/) so these unique
filenames are covered in the .gitignore.

In `@agent_eval/agent/claude_code.py`:
- Around line 24-30: The constructor currently defaults to forwarding the entire
parent environment and only supports a denylist via _env_strip; change this to
build child process env from a minimal allowlist instead of os.environ.copy():
define a small default ALLOWED_ENV_KEYS (e.g., PATH, HOME, LANG or the explicit
vars the harness needs) and construct the environment by selecting only those
keys from os.environ and then applying any explicit opt-ins; update the places
that currently use _env_strip and any other env-building code (the constructor
field _env_strip and the similar env-construction later in the file) to accept
an allowlist/opt-in mechanism and never copy the full os.environ into child
skill runs.

In `@agent_eval/config.py`:
- Around line 98-102: The loop that appends OutputConfig entries from
raw.get("outputs") currently accepts uncontrolled paths (see config.outputs and
OutputConfig construction), enabling path traversal; validate and sanitize each
o.get("path") before appending by (1) rejecting absolute paths and any path
containing parent-directory references like "..", (2) normalizing the path with
os.path.normpath, (3) resolving it against the allowed workspace root (join with
the workspace directory) and ensuring the resulting path stays inside that
workspace using os.path.commonpath, and (4) raise/log an error or skip the
output entry if validation fails so only safe, workspace-contained relative
paths are stored in config.outputs.

In `@skills/eval-run/scripts/collect.py`:
- Around line 51-53: Validate and canonicalize the user-controlled path
fragments before joining: ensure case_id and output_cfg.path (out_path) are not
absolute and do not contain path traversal (..); resolve the joined path (e.g.,
workspace / out_path and workspace / "cases" / case_id) to an absolute path and
verify it is a child of the intended base (workspace.resolve() or
output_root.resolve()) before any exists()/mkdir()/copy operations; reject or
sanitize inputs that start with "/" or contain ".." or produce a resolved path
outside the base, and enforce a whitelist of allowed characters/length for
case_id to prevent tricks — update the logic around variables src_dir, out_path,
and any code that constructs case-specific paths to perform these checks before
proceeding.

In `@skills/eval-run/scripts/score.py`:
- Around line 207-223: Ensure prompt_file and each context path are resolved
under project_root and reject symlinks/escapes: when handling jc.prompt_file
(prompt_path) and each ctx_path (path) call Path.resolve() against
project_root.resolve(), verify the resolved path is inside project_root (use
Path.is_relative_to(project_root.resolve()) or equivalent), reject and raise
ValueError/FileNotFoundError if the resolved path is outside project_root or if
path.is_symlink() is True, and only then read_text(); keep the existing
variables prompt_path, path, jc.prompt_file, jc.context, project_root and kwargs
("name"/"instructions") unchanged.
- Around line 51-63: The loop is vulnerable to path traversal and symlink escape
via config.outputs/ output.path and rglob; before iterating, resolve and
validate artifact_dir against case_dir (use case_dir_resolved =
case_dir.resolve() and artifact_dir_resolved = (case_dir /
out_path).resolve(strict=False)) and skip if artifact_dir_resolved is not inside
case_dir_resolved; then for each discovered file f, skip non-files and any
symlinks (f.is_symlink()) and resolve f (f_resolved = f.resolve(strict=False))
and ensure f_resolved is inside case_dir_resolved (use relative_to or equivalent
check) before reading; only then read_text() or mark as binary, and record using
the path relative to case_dir_resolved.
- Around line 179-186: The _make_inline_check function currently compiles and
execs untrusted jc.check source and must be removed or replaced; instead, change
_make_inline_check to reject inline Python code and accept only trusted
references (e.g., "module:function" strings) from jc.check, validate the format,
whitelist allowed module names, use importlib.import_module to import the module
and getattr to fetch the function (e.g., check_fn = getattr(module, func_name)),
verify the resolved object is callable, and raise a clear error if validation
fails; do not use compile/exec or pass __builtins__ into exec anywhere in
_make_inline_check or related helpers.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 91-99: Reject any absolute or path-traversal symlink entries
before creating links: validate entries from args.symlinks (and resulting
symlink_names) by rejecting strings that are absolute (Path.is_absolute()) or
contain parent segments (".." or normalized path that escapes) and ensure
resolved target path is within project_root (e.g., verify
target.resolve().is_relative_to(project_root) or equivalent check); if
validation fails, skip creating the symlink and emit a clear warning/error; also
validate that the link path will reside inside workspace before calling
link.symlink_to(target.resolve()) to prevent creating links outside allowed
directories.
- Around line 55-58: The code builds a workspace Path with untrusted args.run_id
and blindly rmtree's it (variables: workspace, args.run_id, Path,
shutil.rmtree); to fix, validate/sanitize run_id (allow only a strict safe regex
e.g. alphanumerics, hyphen, underscore) or reject otherwise, derive the
workspace under a fixed canonical base like Path("/tmp/agent-eval").resolve(),
construct the candidate path with base.joinpath(safe_run_id) and call .resolve()
then verify the resolved path is a subpath of the base (reject if not); create
the directory atomically (use tempfile.mkdtemp or Path.mkdir(parents=True,
exist_ok=False) with safe permissions) instead of predicting names, and replace
the unconditional shutil.rmtree with a safe removal that ensures the path is
inside the base (and is not a symlink) before deleting.

---

Major comments:
In `@agent_eval/__main__.py`:
- Around line 116-123: The parsed YAML payload in data (from
yaml.safe_load(data_yaml)) can be scalar/None and leads to AttributeError when
accessing data["skill"]; update the code to validate that data is a dict and
contains a non-empty "skill" string (e.g., check isinstance(data, dict) and
"skill" in data) and fail fast with a clear error if not. For the SKILL.md
lookup, change the skill discovery logic that uses skill_hash/skill_path so it
only computes hashlib.md5 when SKILL.md exists and instead of leaving skill_hash
as an empty string, raise an explicit error if no SKILL.md is found for the
given skill name (referencing variables skill_hash, skill_path, and the skill
value) so invalid eval metadata is never written.

In `@agent_eval/agent/base.py`:
- Around line 35-45: The run_skill signature currently accepts a raw string args
which risks command injection; change the parameter to a structured sequence
(e.g., argv: Sequence[str] or List[str]) in the run_skill method and any
overrides/implementations so callers pass individual argument tokens, update all
call sites to supply a list of args instead of a joined string, and ensure any
runner logic (e.g., subprocess invocations) uses the argv list directly (no
shell=True or string concatenation) and documents the new signature in
RunResult-related usage.

In `@agent_eval/agent/claude_code.py`:
- Around line 79-111: The current subprocess handling reads proc.stdout in a
blocking loop and only reads proc.stderr after, which risks deadlock and never
applies the provided timeout_s to the process lifetime; change the logic around
the subprocess.Popen(...) call and the subsequent I/O so that stdout and stderr
are drained concurrently (e.g., spawn threads or use concurrent.futures to read
proc.stdout and proc.stderr in parallel or use proc.communicate with a timeout),
ensure proc.wait or proc.communicate is called with the timeout_s parameter, and
preserve existing behavior of parsing JSON lines and setting result_obj from
proc.stdout (references: the subprocess.Popen(...) call, proc.stdout loop,
proc.stderr.read(), proc.wait(timeout=30), and the timeout_s parameter).

In `@agent_eval/config.py`:
- Around line 124-127: The project_root property currently always returns
Path.cwd(), which is wrong when the config was loaded from a file outside CWD;
change project_root to derive the directory from the loaded config file path
(e.g., return Path(self.config_path).parent if self.config_path or
self._config_path exists), falling back to Path.cwd() when no config path is
available; update the property in config.py (project_root) to use that attribute
so relative paths resolve relative to where eval.yaml actually lives.
- Around line 81-117: The YAML parsing assumes specific shapes and will
AttributeError on malformed values; validate types before iterating or accessing
mapping keys: ensure dataset = raw.get("dataset", {}) is a dict (otherwise raise
a clear ConfigError/ValueError), ensure raw.get("outputs", []) is a list and
each output item is a dict before creating OutputConfig in the outputs loop, and
ensure raw.get("judges", []) is a list and each judge item is a dict before
creating JudgeConfig in the judges loop; add concise error messages that include
the top-level config name (from cls(...)) or path to help debugging and bail
early instead of letting AttributeError propagate.

In `@agent_eval/mlflow/experiment.py`:
- Around line 53-56: The subprocess call that runs the MLflow setup (the line
using result = subprocess.run(cmd, capture_output=True, text=True)) can hang;
add a timeout argument (e.g. timeout=some_seconds) to subprocess.run and wrap
the call in a try/except for subprocess.TimeoutExpired to handle timeouts: on
timeout, print or log an error including the timeout duration and return False,
and keep the existing stderr handling for non-zero return codes so the function
returns True only on successful completion.

In `@agent_eval/state.py`:
- Around line 50-51: The code indexes sys.argv[2] directly in the
command-handling branches (e.g., when cmd == "init" calling Path(sys.argv[2]))
which crashes on malformed invocations; add guards that verify enough positional
args before indexing (for example check len(sys.argv) > 2 or use argparse to
parse required positional arguments) and emit a clear CLI error/usage message
instead of letting IndexError propagate; update every branch that reads
sys.argv[2] (the branches handling "init" and the other commands referencing
sys.argv[2]) to perform the guard and return a controlled error/exit on missing
args.

In `@skills/eval-analyze/scripts/discover.py`:
- Line 106: The line computing current_hash uses hashlib.md5, which is
vulnerable; replace MD5 with a SHA-256 digest by using hashlib.sha256 on the
file bytes and take an appropriate prefix (or full hex) for freshness checks.
Update the expression that sets current_hash (currently using
hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]) to use
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a safer-length slice) so
the freshness comparison continues to work but with collision-resistant hashing.

In `@skills/eval-mlflow/SKILL.md`:
- Around line 86-99: Before calling mlflow.start_run, persist and look up the
harness run identifier (run_id and case_id) and use it to find an existing
MLflow run (querying by tag on the harness run_id and using paginated/filtering
trace results rather than a fixed 100 limit); if a matching run exists reuse it
with mlflow.start_run(run_id=existing_run_id) otherwise start a new run and
immediately mlflow.set_tag('harness_run_id', run_id) and tag each trace with
'harness_case_id' where appropriate; ensure all
mlflow.log_param/mlflow.log_metric/mlflow.log_artifact calls reference that
stable run and remove hard-coded placeholders so reruns attach to the correct
existing run and traces are discovered via paginated/filtered queries.
- Around line 76-97: The MLflow logging step cannot set the
'regressions_detected' tag because summary.yaml lacks that field; update either
the scorer or the MLflow step: modify skills/eval-run/scripts/score.py (the
function that builds the summary) to compute regressions by comparing each
judge's metric(s) against config.thresholds and persist a boolean/enum
'regressions_detected' into the summary before writing summary.yaml, or keep
score.py as-is and recompute regressions in SKILL.md's MLflow logging block by
loading config.thresholds and the judges' mean/pass_rate from summary.yaml and
deriving the same boolean to pass to mlflow.set_tag('regressions_detected',
...). Ensure you reference the same judge metric names used in the summary and
config.thresholds so the computed tag matches the scorer's logic.

In `@skills/eval-optimize/SKILL.md`:
- Around line 42-43: The /eval-run example invocation in SKILL.md includes an
unsupported --score flag (e.g., "Use the Skill tool to invoke /eval-run --config
<config> --model <model> --run-id <id>-iter-<N> --score"); remove the --score
flag from that command and any other occurrences (the other instance noted
around the same block) so the listed Skill invocation matches the eval-run skill
interface, and update the example text to use "/eval-run --config <config>
--model <model> --run-id <id>-iter-<N>" instead.

In `@skills/eval-run/scripts/collect.py`:
- Around line 43-45: The YAML load of case_order (from order_path) can return
None or a non-list which later causes len(case_order) and index access to fail;
update the code in collect.py after the yaml.safe_load call to validate that
case_order is a list (e.g., isinstance(case_order, list)) and raise a clear
ValueError or exit with an explanatory message including order_path when it is
None or not a list so downstream uses of case_order (len(case_order), indexing)
are safe.

In `@skills/eval-run/scripts/execute.py`:
- Line 45: The CLI flag "--disallowed-tools" is parsed
(parser.add_argument("--disallowed-tools")) but never enforced, allowing policy
bypass; parse args.disallowed_tools into a normalized list (split
comma/whitespace), validate entries, and before the execution block that runs
tools (the code around the current Line 71-79 execution), check and either
remove those tools from the tool list or abort with an error if a disallowed
tool is requested; update any call sites that invoke tool execution (e.g., the
function or block that uses parsed args to select tools) to consult
args.disallowed_tools and enforce the restriction.

In `@skills/eval-run/scripts/score.py`:
- Around line 538-556: cmd_regression currently passes config.thresholds
directly to detect_regressions which breaks when thresholds is None; update
cmd_regression (after config = EvalConfig.from_yaml(...)) to check if
config.thresholds is falsy and either set thresholds = {} before calling
detect_regressions or short-circuit with a clear message/exit indicating "no
thresholds configured" so detect_regressions is never invoked with None;
reference config.thresholds, cmd_regression, and detect_regressions when making
the change.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 111-123: The loop that reads case files (the for name in
sorted(case_dir.iterdir()) block) can raise YAML/JSON parser exceptions and
abort the run; wrap each file's parsing in a try/except that catches
yaml.YAMLError, json.JSONDecodeError (and a broad Exception fallback), log a
warning including the filename (name) and error, and continue to the next file
instead of letting the exception propagate; only return data when
isinstance(data, dict) and skip non-dict or failed parses.

In `@skills/eval-run/SKILL.md`:
- Around line 23-27: The check currently hardcodes "eval.yaml" (seen on the test
command lines around line 26 and 44) which ignores the user-supplied config
path; update the commands to use the parsed config path variable instead of the
literal "eval.yaml" (e.g., replace occurrences of "eval.yaml" with the variable
that holds the supplied config path such as
config_path/CONFIG_PATH/parsedConfigPath in the script or template), ensuring
both the existence check and any subsequent uses reference that variable so
validation and execution use the same file.

In `@skills/eval-setup/scripts/check_env.py`:
- Around line 64-74: The code currently treats any non-empty MLFLOW_TRACKING_URI
as "OK" without validating connectivity; update the logic so that when
MLFLOW_TRACKING_URI is present you still call _check_mlflow_server (or otherwise
validate the URI) to determine mlflow_ok and set mlflow_detail based on the
result (e.g., actual URI on success or an error/reachability message on
failure); also change the checks.append call to use the computed mlflow_ok (not
the hardcoded True) and include the validation result in the message so typos or
unreachable URIs are reported correctly.

In `@skills/eval-setup/SKILL.md`:
- Around line 16-18: check_env.py currently short-circuits to "ready" without
verifying the tracing hook or valid config; update the readiness shortcut so it
only returns success after confirming tracing is configured (presence and
validity of .claude/settings.json or the tracing hook) and do not swallow
EvalConfig.from_yaml() exceptions — let from_yaml raise and fail fast instead of
falling back to retry without --config. Specifically: in check_env.py ensure the
Step 1 fast-path validates tracing setup (the tracing hook check and
.claude/settings.json) before skipping Steps 3–5, and in the code paths that
call EvalConfig.from_yaml() do not catch/ignore parsing errors and retry
silently with no --config; propagate the error to surface malformed eval.yaml
and treat it as a non-ready state.
- Around line 78-82: The current shell snippet injects $(pwd) and
MLFLOW_TRACKING_URI into a single-quoted Python literal which is unsafe; change
the invocation to run a Python script that reads values inside Python (use
os.getcwd() for the working dir and os.environ.get("MLFLOW_TRACKING_URI",
"http://127.0.0.1:5000") for the tracking URI) and then call setup_autolog from
agent_eval.mlflow.experiment with those values (refer to setup_autolog to locate
the call); use a here-doc or a small Python file instead of python3 -c with
shell-expanded values.

---

Minor comments:
In `@agent_eval/state.py`:
- Line 88: Rename the ambiguous list comprehension variable `l` to a clearer
name like `line` in the assignment that builds `ids` (the expression using
path.read_text().splitlines()); update the comprehension from `[l.strip() for l
in ... if l.strip()]` to use `line` so it becomes `[line.strip() for line in ...
if line.strip()]` to improve readability and satisfy linters.

In `@README.md`:
- Line 7: The fenced code blocks in README.md are missing language identifiers
and failing markdownlint (MD040); update each triple-backtick fence at the
reported locations to include an explicit language tag (e.g., ```text, ```bash,
```yaml, ```json) appropriate to the block contents so markdownlint passes —
search for the unnamed fenced code blocks and add the correct language
identifier for each occurrence.

In `@skills/eval-optimize/SKILL.md`:
- Line 12: Replace the bare triple-backtick fenced code blocks in SKILL.md with
language-tagged fences (e.g., change ``` to ```text or ```bash) so markdownlint
MD040 is satisfied; specifically update the fenced blocks at the reported
locations (the ``` fence at line 12 and the other occurrences around lines 41
and 59) to use the appropriate language identifier for each block.

In `@skills/eval-run/SKILL.md`:
- Line 31: The markdown file SKILL.md contains unnamed fenced code blocks that
trigger markdownlint MD040; update each fenced block by appending an appropriate
language tag (e.g., ```text, ```bash, ```javascript) to the opening fence so all
code fences declare a language and satisfy the linter.

---

Nitpick comments:
In @.gitignore:
- Around line 1-9: Update the project .gitignore to include common Python dev
artifact entries that are missing: add .pytest_cache/, .coverage, .mypy_cache/,
.ruff_cache/, .tox/, and htmlcov/ alongside the existing entries (__pycache__/,
*.py[cod], *.egg-info/, *.egg, dist/, build/, .venv/, venv/); ensure these exact
patterns are appended so test, coverage, type-checker and formatter caches and
test reports are ignored by VCS.

In `@agent_eval/agent/__init__.py`:
- Line 10: The module-level __all__ list is unsorted (currently contains
"EvalRunner", "RunResult", "ClaudeCodeRunner", "RUNNERS") which triggers Ruff
RUF022; reorder the entries in __all__ to be alphabetically sorted (e.g.,
"ClaudeCodeRunner", "EvalRunner", "RUNNERS", "RunResult") so the symbol list is
deterministic and lint-clean.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9269d9b3-c24c-4275-90ed-df28d5bd0b14

📥 Commits

Reviewing files that changed from the base of the PR and between 90f5996 and bc48951.

📒 Files selected for processing (30)

.claude-plugin/plugin.json
.gitignore
CLAUDE.md
README.md
agent_eval/__init__.py
agent_eval/__main__.py
agent_eval/agent/__init__.py
agent_eval/agent/base.py
agent_eval/agent/claude_code.py
agent_eval/config.py
agent_eval/mlflow/__init__.py
agent_eval/mlflow/experiment.py
agent_eval/state.py
eval.yaml
pyproject.toml
skills/eval-analyze/SKILL.md
skills/eval-analyze/prompts/analyze-skill.md
skills/eval-analyze/prompts/generate-eval-md.md
skills/eval-analyze/scripts/discover.py
skills/eval-mlflow/SKILL.md
skills/eval-optimize/SKILL.md
skills/eval-run/SKILL.md
skills/eval-run/prompts/analyze-results.md
skills/eval-run/prompts/comparison-judge.md
skills/eval-run/scripts/collect.py
skills/eval-run/scripts/execute.py
skills/eval-run/scripts/score.py
skills/eval-run/scripts/workspace.py
skills/eval-setup/SKILL.md
skills/eval-setup/scripts/check_env.py

.gitignore

agent_eval/agent/claude_code.py

agent_eval/config.py

skills/eval-run/scripts/collect.py

skills/eval-run/scripts/score.py

skills/eval-run/scripts/workspace.py

- Add secrets patterns to .gitignore (.env, *.key, *.pem) - Use env allowlist in ClaudeCodeRunner instead of forwarding entire environment - Validate relative paths in config.py (reject absolute and .. traversal) - Validate case_id and output paths in collect.py - Add _resolve_under() in score.py to prevent symlink/path escapes - Validate prompt_file and context paths against project root - Validate run-id format and secure temp path creation in workspace.py - Validate symlink entries in workspace.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 13

♻️ Duplicate comments (5)

skills/eval-run/scripts/workspace.py (2)

91-99: ⚠️ Potential issue | 🔴 Critical

Critical: unvalidated symlink entries allow workspace/path escape (CWE-22, CWE-73).

Lines 91-99 accept raw --symlinks entries and use them directly for both source and link paths. Exploit scenario: entries like ../../outside or absolute paths can create links outside the workspace or expose unintended host files.

Remediation patch

+def _validate_relative_entry(name: str) -> str:
+    p = Path(name)
+    if p.is_absolute() or ".." in p.parts:
+        raise ValueError(f"Invalid symlink entry: {name}")
+    return name
@@
     for name in symlink_names:
-        target = project_root / name
-        link = workspace / name
+        try:
+            safe_name = _validate_relative_entry(name)
+        except ValueError as e:
+            print(f"ERROR: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        target = (project_root / safe_name).resolve()
+        link = (workspace / safe_name).resolve()
+        if project_root.resolve() not in target.parents and target != project_root.resolve():
+            print(f"ERROR: target escapes project root: {safe_name}", file=sys.stderr)
+            sys.exit(1)
+        if workspace.resolve() not in link.parents and link != workspace.resolve():
+            print(f"ERROR: link escapes workspace: {safe_name}", file=sys.stderr)
+            sys.exit(1)
         if target.exists():
+            link.parent.mkdir(parents=True, exist_ok=True)
             link.symlink_to(target.resolve())

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 91 - 99, The code creates
symlinks from raw args.symlinks without validating paths, allowing path
traversal or absolute paths to escape the workspace; update the logic that
builds symlink_names and creates target/link (referencing args.symlinks,
symlink_names, target, link, project_root, workspace, and link.symlink_to) to
validate each entry: reject or normalize entries containing absolute paths or
parent-traversal segments (".." or starting with "/"), resolve the candidate
target with project_root.resolve() and ensure
target.resolve().is_relative_to(project_root) (or catch ValueError from
relative_to) and similarly ensure the final link would reside under workspace
before calling link.symlink_to; log or raise and skip invalid entries.

55-58: ⚠️ Potential issue | 🔴 Critical

Critical: path traversal + unsafe recursive delete in workspace setup (CWE-22, CWE-377, CWE-59).

Line 55 interpolates untrusted --run-id into a filesystem path, and Line 57 recursively deletes it. Exploit scenario: --run-id ../../target (or symlink tricks) can delete paths outside the intended temp root.

Remediation patch

 import argparse
+import re
 import shutil
 import sys
+import tempfile
 from pathlib import Path
@@
-    workspace = Path(f"/tmp/agent-eval/{args.run_id}")
+    if not re.fullmatch(r"[A-Za-z0-9._-]+", args.run_id):
+        print("ERROR: invalid run-id", file=sys.stderr)
+        sys.exit(1)
+
+    base_dir = (Path(tempfile.gettempdir()) / "agent-eval").resolve()
+    workspace = (base_dir / args.run_id).resolve()
+    if base_dir not in workspace.parents:
+        print("ERROR: run-id escapes base dir", file=sys.stderr)
+        sys.exit(1)
+
     if workspace.exists():
+        if workspace.is_symlink():
+            print("ERROR: refusing to delete symlink workspace", file=sys.stderr)
+            sys.exit(1)
         shutil.rmtree(workspace)
-    workspace.mkdir(parents=True)
+    workspace.mkdir(parents=True, mode=0o700)

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 55 - 58, The current
workspace setup uses args.run_id directly and recursively deletes the resolved
path (workspace), allowing path traversal and unsafe deletes; fix by deriving
the workspace under a fixed base (e.g., base = Path("/tmp/agent-eval")),
sanitize and validate args.run_id (reject path separators, .., empty or absolute
values) or better generate a safe unique name (use tempfile or uuid) instead of
trusting input, resolve and confirm
workspace.resolve().is_relative_to(base.resolve()) before any rmtree, and only
call shutil.rmtree(workspace) when that check passes; update references to
workspace and args.run_id accordingly and ensure directory creation uses
workspace.mkdir(parents=True, exist_ok=False).

skills/eval-run/scripts/score.py (3)

207-223: ⚠️ Potential issue | 🔴 Critical

Critical: judge prompt/context file paths are unconstrained (CWE-22, CWE-73, CWE-200).

Paths from jc.prompt_file and jc.context are read directly. Exploit scenario: ../../.env or absolute paths can exfiltrate host secrets to external judge models.

Remediation patch

+def _resolve_project_file(project_root: Path, raw_path: str) -> Path:
+    p = Path(raw_path)
+    p = p if p.is_absolute() else project_root / p
+    if p.is_symlink():
+        raise ValueError(f"Symlink not allowed: {raw_path}")
+    resolved = p.resolve()
+    root = project_root.resolve()
+    if root not in resolved.parents and resolved != root:
+        raise ValueError(f"Path escapes project root: {raw_path}")
+    return resolved
@@
     if not prompt and jc.prompt_file:
-        prompt_path = Path(jc.prompt_file)
-        if project_root and not prompt_path.is_absolute():
-            prompt_path = project_root / prompt_path
+        prompt_path = _resolve_project_file(project_root or Path.cwd(), jc.prompt_file)
         if not prompt_path.exists():
             raise FileNotFoundError(f"Judge prompt not found: {prompt_path}")
         prompt = prompt_path.read_text()
@@
     for ctx_path in jc.context:
-        path = Path(ctx_path)
-        if project_root and not path.is_absolute():
-            path = project_root / path
+        path = _resolve_project_file(project_root or Path.cwd(), ctx_path)
         if path.exists():
             prompt += f"\n\n## Context: {path.name}\n\n{path.read_text()}"

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 207 - 223, The prompt and
context file handling (variables prompt_path, path; logic around jc.prompt_file
and jc.context) currently reads arbitrary paths; fix by validating and
sandboxing paths before reading: resolve each Path with .resolve() and ensure
the resolved path is inside a permitted base (e.g., project_root.resolve() or a
specific judges directory) — reject absolute or traversing paths that fall
outside that base, raise an error/log and skip or abort; additionally enforce an
allowlist of allowed filename patterns or extensions if appropriate and avoid
reading dotfiles or known secret files. Use the same checks for prompt_path and
each ctx_path and only call read_text() after validation, updating any error
messages to reference jc.name.

179-186: ⚠️ Potential issue | 🔴 Critical

Critical: inline judge code execution enables arbitrary code execution (CWE-94).

Lines 179-186 compile and execute config-supplied Python (jc.check) with builtins available. Exploit scenario: malicious eval config can execute OS commands, read secrets, and tamper scorer outputs.

Remediation patch

 def _make_inline_check(jc):
-    """Create a scorer from an inline check script."""
-    source = jc.check
-    wrapped = f"def _check(outputs):\n{textwrap.indent(source, '    ')}"
-    code = compile(wrapped, f"<check:{jc.name}>", "exec")
-    ns = {"__builtins__": __builtins__}
-    exec(code, ns)
-    check_fn = ns["_check"]
+    """Load a trusted check callable from module:function reference."""
+    ref = jc.check.strip()
+    if ":" not in ref:
+        raise ValueError(f"Judge '{jc.name}' check must be module:function")
+    module_name, fn_name = ref.split(":", 1)
+    mod = importlib.import_module(module_name)
+    check_fn = getattr(mod, fn_name, None)
+    if not callable(check_fn):
+        raise ValueError(f"Judge '{jc.name}' function not callable: {ref}")

As per coding guidelines, **/*.py: No eval() or exec() with untrusted input (CWE-94) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 179 - 186, The current
_make_inline_check compiles and execs untrusted jc.check with full builtins,
enabling arbitrary code execution; replace this by disallowing direct exec of
jc.check and instead run checks in a hardened sandbox (e.g., execute the check
in a separate, isolated process with no inherited secrets, restricted
environment, resource/time limits, and no access to host builtins) or
validate/transform jc.check into a safe subset via AST whitelisting before
compilation; specifically, remove the direct exec(code, ns) flow in
_make_inline_check and either (a) invoke a sandboxed runner that receives the
wrapped function string and returns results, or (b) replace ns with a minimal,
explicit safe API and perform AST checks on jc.check to block dangerous nodes
(Import, Exec, Eval, Attribute access to os/sys, etc.) before creating check_fn.

51-79: ⚠️ Potential issue | 🔴 Critical

Critical: artifact loading allows path/symlink escape and data exfiltration (CWE-22, CWE-59, CWE-200).

output.path is config-controlled and file traversal uses rglob without root containment checks. Exploit scenario: malicious paths/symlinks can pull host files into summary.yaml/downstream logging.

Remediation patch

+def _resolve_under(root: Path, candidate: Path) -> Path:
+    root_resolved = root.resolve()
+    resolved = candidate.resolve()
+    if root_resolved not in resolved.parents and resolved != root_resolved:
+        raise ValueError(f"Path escapes case dir: {candidate}")
+    return resolved
@@
     for output in config.outputs:
         out_path = output.path or "."
-        artifact_dir = case_dir / out_path
+        artifact_dir = _resolve_under(case_dir, case_dir / out_path)
         if not artifact_dir.exists():
             continue
         for f in sorted(artifact_dir.rglob("*")):
-            if not f.is_file():
+            if not f.is_file() or f.is_symlink():
                 continue
-            rel = str(f.relative_to(case_dir))
+            safe_f = _resolve_under(case_dir, f)
+            rel = str(f.relative_to(case_dir))
             try:
-                record["files"][rel] = f.read_text()
+                record["files"][rel] = safe_f.read_text()

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 51 - 79, The artifact loading
allows untrusted output.path to escape the intended case_dir via path traversal
or symlinks; update the loops that use config.outputs / output.path /
artifact_dir and the rglob/iterdir traversal to validate and constrain file
access: normalize and reject absolute or parent-traversal paths (resolve
output.path against case_dir), compute resolved_artifact_dir = (case_dir /
out_path).resolve() and ensure resolved_artifact_dir is within
case_dir.resolve() (skip and log if not), and when iterating files check each
file's resolved path (f.resolve()) is also within case_dir.resolve() before
reading into record["files"] or setting {key}_content/_file; also avoid
following symlinks if you intend to disallow them (use lstat or skip
f.is_symlink()), and skip any files that fail containment checks to prevent
exfiltration.

🧹 Nitpick comments (5)

agent_eval/mlflow/experiment.py (2)

27-39: Inconsistent health-check endpoint vs. check_env.py.

ensure_server probes /api/2.0/mlflow/experiments/search while skills/eval-setup/scripts/check_env.py:140 uses /health. The /health endpoint is more appropriate for a simple liveness check (lighter, dedicated health endpoint).

 def ensure_server(port: int = 5000) -> bool:
-    """Check if MLflow server is running, optionally start it.
+    """Check if MLflow server is running.
 
     Returns:
         True if server is available.
     """
     import urllib.request
     try:
-        urllib.request.urlopen(f"http://127.0.0.1:{port}/api/2.0/mlflow/experiments/search",
+        urllib.request.urlopen(f"http://127.0.0.1:{port}/health",
                               timeout=2)
         return True
     except Exception:
         return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 27 - 39, The ensure_server
function is using the heavyweight experiments/search endpoint for health checks;
change it to probe the lightweight /health endpoint instead by calling
urllib.request.urlopen against f"http://127.0.0.1:{port}/health" (keep the
existing timeout and try/except semantics), i.e., update the URL used in
ensure_server so the health probe matches the /health check used in check_env.py
and still returns True on success and False on exception.

72-74: Silent exception swallowing loses diagnostics.

log_feedback catches and discards all exceptions. A trace-level log would aid debugging without disrupting the caller.

     except Exception:
-        pass
+        pass  # Feedback logging is best-effort; failures are non-fatal

Or better, add optional logging:

except Exception as e:
    import logging
    logging.debug("Failed to log feedback for trace %s: %s", trace_id, e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 72 - 74, In log_feedback, don't
silently swallow all exceptions; replace the bare except/pass with an except
Exception as e that logs the failure including trace_id and the exception
details at debug/trace level (e.g., use logging.getLogger(__name__).debug with a
message like "Failed to log feedback for trace %s: %s" and include exc_info=True
or the exception string) so diagnostics are preserved without changing caller
behavior.

agent_eval/__main__.py (1)

19-25: Mutating sys.argv is fragile.

Slicing sys.argv in-place works but is non-obvious. Consider passing arguments explicitly to subcommand handlers.

     if command == "config":
-        sys.argv = sys.argv[1:]
-        _config_command()
+        _config_command(sys.argv[2:])
     elif command == "state":
-        sys.argv = sys.argv[1:]
         from .state import main as state_main
-        state_main()
+        state_main(sys.argv[2:])

This requires updating _config_command and state.main to accept args.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/__main__.py` around lines 19 - 25, The current code mutates
sys.argv in-place before delegating to subcommands (_config_command and
state.main), which is fragile; instead, change the call sites to pass the slice
explicitly (e.g., args = sys.argv[1:]) and update _config_command and the
state.main function (imported as state_main) to accept an args parameter (e.g.,
def _config_command(args): and def main(args):) and use that passed-in list for
argument parsing; provide a sensible default signature (args=None) if needed for
backwards compatibility and update internal parsing to use the provided args
rather than reading sys.argv directly.

agent_eval/state.py (1)

85-89: Rename ambiguous loop variable l to line.

Static analysis flags l as ambiguous (easily confused with 1 or I).

     elif cmd == "read-ids":
         path = Path(sys.argv[2])
         if path.exists():
-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]
             print(" ".join(ids))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 85 - 89, In the "read-ids" branch (the cmd
== "read-ids" block) the list comprehension that builds ids uses the ambiguous
loop variable named `l`; rename that variable to `line` in the comprehension
(i.e., change `[l.strip() for l in path.read_text().splitlines() if l.strip()]`
to use `line`) so the variable is clearer and avoid confusion with `1`/`I`;
ensure the variable name change is only within the list comprehension that
assigns to `ids`.

agent_eval/agent/claude_code.py (1)

124-125: Clarify the hardcoded 30s timeout in proc.wait().

The function accepts timeout_s (default 600), but proc.wait(timeout=30) uses a hardcoded 30s. This is for draining stderr after stdout completes, not the main execution timeout—consider adding a brief comment to clarify intent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 124 - 125, The code uses a
magic literal proc.wait(timeout=30) while the function accepts timeout_s
(default 600); replace the hardcoded 30 with a named drain timeout variable
(e.g., drain_timeout = 30) or derive it from timeout_s (e.g., min(30,
timeout_s)) and add a brief inline comment next to proc.wait explaining this is
a short drain wait to read stderr after stdout completion—not the main process
execution timeout governed by timeout_s; reference the proc.wait call and the
timeout_s parameter to locate and update the behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/__main__.py`:
- Line 138: The module calls main() unconditionally which breaks importability;
change the unconditional invocation to a guarded entry point by removing the
bare main() call and invoking main() only inside a module guard like if __name__
== "__main__": main() so imports don't execute the main() function when the
module (and symbols like main) are imported for tests or other code.

In `@agent_eval/state.py`:
- Around line 50-55: The code accesses sys.argv[2] without checking length for
the "init" (and similarly "set", "read", "write-ids", "read-ids") commands which
causes IndexError; update the command handling around cmd (the variable) to
validate len(sys.argv) before reading sys.argv[2] (and any other required
positional args), and on missing args call a clear error/usage path (e.g., print
a usage message or raise SystemExit) instead of indexing directly; locate the
blocks using sys.argv[2] and _parse_kwargs and add the bounds checks and
appropriate error handling for each command.

In `@skills/eval-analyze/scripts/discover.py`:
- Line 106: The freshness check uses a weak collision-prone MD5 hash: replace
the MD5 computation at current_hash =
hashlib.md5(skill_path.read_bytes()).hexdigest()[:12] with a secure hash (e.g.,
hashlib.sha256) and avoid truncating the digest (or at minimum use a much longer
slice) so the check becomes current_hash =
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a sufficiently long
prefix); update any dependent comparisons that expect the 12-char MD5 prefix
accordingly to use the new SHA-256 value (references: current_hash, skill_path
in discover.py).
- Around line 35-36: The bare "except Exception: pass" in discover.py hides I/O
and frontmatter parse errors; replace it with explicit exception handling that
catches specific failures (e.g., OSError/IOError, UnicodeDecodeError, and
yaml.YAMLError) and log a warning that includes the file path and the exception
(use the module's logger or stderr) so discovery continues but failures are
visible; do NOT swallow all Exceptions—only handle expected parse/read errors
and let truly unexpected exceptions propagate.

In `@skills/eval-optimize/SKILL.md`:
- Around line 12-14: The fenced code blocks (e.g., the command example
containing "/eval-optimize [config_file] --model <model> [--max-iterations <N>]"
and the other blocks at the noted locations) are untyped and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language identifier such as text or bash (for example change ``` to ```text or
```bash) so all code fences are explicitly typed (apply the same change to the
blocks around lines 41-43 and 59-61).

In `@skills/eval-run/scripts/execute.py`:
- Line 71: The code dereferences args.settings when computing settings_path
causing an AttributeError because the CLI parser never defines --settings;
either add a proper CLI option (e.g., parser.add_argument('--settings',
help=..., type=str, default=None)) to the argument parser that produces the args
object, or defensively access the attribute (e.g., use getattr(args, 'settings',
None)) before creating settings_path; update the settings_path assignment that
references args.settings accordingly so run_skill() no longer crashes.
- Around line 55-57: The code at the agent resolution (where agent = args.agent
or config.runner and runner_cls = RUNNERS[agent]) can raise an unhandled
KeyError for invalid agent names; update the logic to validate the chosen agent
before indexing RUNNERS: check whether the resolved agent value is in RUNNERS
(or use RUNNERS.get(agent)) and if not raise a clear ValueError (or SystemExit)
that includes the invalid value and a sorted/listed set of valid runner keys
(i.e., reference the agent variable and RUNNERS mapping) so callers see
available options and the program fails closed with an explanatory message.

In `@skills/eval-run/scripts/score.py`:
- Line 489: Remove the unnecessary f-string prefixes on static print calls in
score.py: replace print(f"\n  REGRESSIONS: 0") and the similar print(...) at the
other occurrence (line referenced by the review) with plain string prints (e.g.,
print("\n  REGRESSIONS: 0")) to fix F541 lint errors; locate these in the same
scope where the "REGRESSIONS" print appears in the scoring logic and update both
print calls accordingly.
- Around line 117-118: The code computes parallelism using parallelism =
min(len(case_dirs), os.cpu_count() or 4) and then creates a ThreadPoolExecutor
which fails when case_dirs is empty; add an explicit check on case_dirs (e.g.,
if not case_dirs) before computing parallelism/instantiating ThreadPoolExecutor
and handle it by returning early or skipping work with an appropriate
message/exit code; update the area around the parallelism and lock variables
(parallelism, case_dirs, lock) and any ThreadPoolExecutor creation to use this
early-exit guard so a zero-length case_dirs never leads to worker-count = 0.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 113-121: The YAML/JSON parsing in workspace.py currently calls
yaml.safe_load(name) and json.load(name) without exception handling, so a
malformed case file can abort the run; wrap the YAML branch (the block using
yaml.safe_load on variable name) in a try/except that catches yaml.YAMLError
(and general Exception/ValueError as fallback), log or skip the file and
continue instead of returning/raising, and likewise wrap the JSON branch (the
block using json.load on variable name) in a try/except that catches
json.JSONDecodeError (and ValueError/Exception fallback), skip the invalid file
and continue; ensure you reference the same control flow around the variable
name and preserve the existing behavior of returning a dict only when parsing
succeeds.

In `@skills/eval-run/SKILL.md`:
- Around line 31-33: The two fenced code blocks in SKILL.md (one containing "Use
the Skill tool to invoke /eval-analyze [--skill <skill>]" and the other at the
block around lines 141-143) lack language identifiers and trigger MD040; add the
appropriate fenced code language specifier (e.g., bash, sh, or text as
appropriate for the snippet) after the opening ``` so each fence becomes ```bash
(or another correct language) to satisfy the linter.
- Around line 25-27: The config-existence check currently hardcodes "eval.yaml"
and ignores the CLI --config argument; update the check to use the parsed config
variable used by the script (the same variable populated from --config, e.g.,
CONFIG or CONFIG_PATH) instead of the literal "eval.yaml", ensure you quote the
variable when testing (to handle spaces) and fall back to "eval.yaml" only if
the parsed config variable is empty so the documented --config <path> flow is
respected.

In `@skills/eval-setup/SKILL.md`:
- Around line 97-108: The current shell snippet silences stderr (2>/dev/null)
which masks errors from EvalConfig.from_yaml and makes the fallback message
misleading; update the snippet to either remove the stderr redirection so
exceptions surface, or explicitly check for file existence and catch errors from
EvalConfig.from_yaml to log the real error before printing the fallback.
Specifically adjust the block that calls EvalConfig.from_yaml and
setup_experiment (referencing EvalConfig.from_yaml and setup_experiment and the
config.mlflow_experiment branch) to surface or log parse/validation exceptions
instead of redirecting stderr to /dev/null.

---

Duplicate comments:
In `@skills/eval-run/scripts/score.py`:
- Around line 207-223: The prompt and context file handling (variables
prompt_path, path; logic around jc.prompt_file and jc.context) currently reads
arbitrary paths; fix by validating and sandboxing paths before reading: resolve
each Path with .resolve() and ensure the resolved path is inside a permitted
base (e.g., project_root.resolve() or a specific judges directory) — reject
absolute or traversing paths that fall outside that base, raise an error/log and
skip or abort; additionally enforce an allowlist of allowed filename patterns or
extensions if appropriate and avoid reading dotfiles or known secret files. Use
the same checks for prompt_path and each ctx_path and only call read_text()
after validation, updating any error messages to reference jc.name.
- Around line 179-186: The current _make_inline_check compiles and execs
untrusted jc.check with full builtins, enabling arbitrary code execution;
replace this by disallowing direct exec of jc.check and instead run checks in a
hardened sandbox (e.g., execute the check in a separate, isolated process with
no inherited secrets, restricted environment, resource/time limits, and no
access to host builtins) or validate/transform jc.check into a safe subset via
AST whitelisting before compilation; specifically, remove the direct exec(code,
ns) flow in _make_inline_check and either (a) invoke a sandboxed runner that
receives the wrapped function string and returns results, or (b) replace ns with
a minimal, explicit safe API and perform AST checks on jc.check to block
dangerous nodes (Import, Exec, Eval, Attribute access to os/sys, etc.) before
creating check_fn.
- Around line 51-79: The artifact loading allows untrusted output.path to escape
the intended case_dir via path traversal or symlinks; update the loops that use
config.outputs / output.path / artifact_dir and the rglob/iterdir traversal to
validate and constrain file access: normalize and reject absolute or
parent-traversal paths (resolve output.path against case_dir), compute
resolved_artifact_dir = (case_dir / out_path).resolve() and ensure
resolved_artifact_dir is within case_dir.resolve() (skip and log if not), and
when iterating files check each file's resolved path (f.resolve()) is also
within case_dir.resolve() before reading into record["files"] or setting
{key}_content/_file; also avoid following symlinks if you intend to disallow
them (use lstat or skip f.is_symlink()), and skip any files that fail
containment checks to prevent exfiltration.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 91-99: The code creates symlinks from raw args.symlinks without
validating paths, allowing path traversal or absolute paths to escape the
workspace; update the logic that builds symlink_names and creates target/link
(referencing args.symlinks, symlink_names, target, link, project_root,
workspace, and link.symlink_to) to validate each entry: reject or normalize
entries containing absolute paths or parent-traversal segments (".." or starting
with "/"), resolve the candidate target with project_root.resolve() and ensure
target.resolve().is_relative_to(project_root) (or catch ValueError from
relative_to) and similarly ensure the final link would reside under workspace
before calling link.symlink_to; log or raise and skip invalid entries.
- Around line 55-58: The current workspace setup uses args.run_id directly and
recursively deletes the resolved path (workspace), allowing path traversal and
unsafe deletes; fix by deriving the workspace under a fixed base (e.g., base =
Path("/tmp/agent-eval")), sanitize and validate args.run_id (reject path
separators, .., empty or absolute values) or better generate a safe unique name
(use tempfile or uuid) instead of trusting input, resolve and confirm
workspace.resolve().is_relative_to(base.resolve()) before any rmtree, and only
call shutil.rmtree(workspace) when that check passes; update references to
workspace and args.run_id accordingly and ensure directory creation uses
workspace.mkdir(parents=True, exist_ok=False).

---

Nitpick comments:
In `@agent_eval/__main__.py`:
- Around line 19-25: The current code mutates sys.argv in-place before
delegating to subcommands (_config_command and state.main), which is fragile;
instead, change the call sites to pass the slice explicitly (e.g., args =
sys.argv[1:]) and update _config_command and the state.main function (imported
as state_main) to accept an args parameter (e.g., def _config_command(args): and
def main(args):) and use that passed-in list for argument parsing; provide a
sensible default signature (args=None) if needed for backwards compatibility and
update internal parsing to use the provided args rather than reading sys.argv
directly.

In `@agent_eval/agent/claude_code.py`:
- Around line 124-125: The code uses a magic literal proc.wait(timeout=30) while
the function accepts timeout_s (default 600); replace the hardcoded 30 with a
named drain timeout variable (e.g., drain_timeout = 30) or derive it from
timeout_s (e.g., min(30, timeout_s)) and add a brief inline comment next to
proc.wait explaining this is a short drain wait to read stderr after stdout
completion—not the main process execution timeout governed by timeout_s;
reference the proc.wait call and the timeout_s parameter to locate and update
the behavior.

In `@agent_eval/mlflow/experiment.py`:
- Around line 27-39: The ensure_server function is using the heavyweight
experiments/search endpoint for health checks; change it to probe the
lightweight /health endpoint instead by calling urllib.request.urlopen against
f"http://127.0.0.1:{port}/health" (keep the existing timeout and try/except
semantics), i.e., update the URL used in ensure_server so the health probe
matches the /health check used in check_env.py and still returns True on success
and False on exception.
- Around line 72-74: In log_feedback, don't silently swallow all exceptions;
replace the bare except/pass with an except Exception as e that logs the failure
including trace_id and the exception details at debug/trace level (e.g., use
logging.getLogger(__name__).debug with a message like "Failed to log feedback
for trace %s: %s" and include exc_info=True or the exception string) so
diagnostics are preserved without changing caller behavior.

In `@agent_eval/state.py`:
- Around line 85-89: In the "read-ids" branch (the cmd == "read-ids" block) the
list comprehension that builds ids uses the ambiguous loop variable named `l`;
rename that variable to `line` in the comprehension (i.e., change `[l.strip()
for l in path.read_text().splitlines() if l.strip()]` to use `line`) so the
variable is clearer and avoid confusion with `1`/`I`; ensure the variable name
change is only within the list comprehension that assigns to `ids`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 1d9d3416-66d7-4ee6-82e5-0823d93777b4

📥 Commits

Reviewing files that changed from the base of the PR and between bc48951 and 4c6526f.

📒 Files selected for processing (30)

.claude-plugin/plugin.json
.gitignore
CLAUDE.md
README.md
agent_eval/__init__.py
agent_eval/__main__.py
agent_eval/agent/__init__.py
agent_eval/agent/base.py
agent_eval/agent/claude_code.py
agent_eval/config.py
agent_eval/mlflow/__init__.py
agent_eval/mlflow/experiment.py
agent_eval/state.py
eval.yaml
pyproject.toml
skills/eval-analyze/SKILL.md
skills/eval-analyze/prompts/analyze-skill.md
skills/eval-analyze/prompts/generate-eval-md.md
skills/eval-analyze/scripts/discover.py
skills/eval-mlflow/SKILL.md
skills/eval-optimize/SKILL.md
skills/eval-run/SKILL.md
skills/eval-run/prompts/analyze-results.md
skills/eval-run/prompts/comparison-judge.md
skills/eval-run/scripts/collect.py
skills/eval-run/scripts/execute.py
skills/eval-run/scripts/score.py
skills/eval-run/scripts/workspace.py
skills/eval-setup/SKILL.md
skills/eval-setup/scripts/check_env.py

✅ Files skipped from review due to trivial changes (13)

agent_eval/mlflow/init.py
agent_eval/init.py
.gitignore
.claude-plugin/plugin.json
skills/eval-run/prompts/comparison-judge.md
skills/eval-run/prompts/analyze-results.md
skills/eval-analyze/prompts/generate-eval-md.md
pyproject.toml
CLAUDE.md
eval.yaml
skills/eval-analyze/prompts/analyze-skill.md
skills/eval-mlflow/SKILL.md
agent_eval/config.py

🚧 Files skipped from review as they are similar to previous changes (2)

agent_eval/agent/base.py
skills/eval-run/scripts/collect.py

agent_eval/__main__.py

agent_eval/state.py

skills/eval-analyze/scripts/discover.py

skills/eval-optimize/SKILL.md

skills/eval-run/scripts/score.py

skills/eval-run/scripts/workspace.py

skills/eval-run/SKILL.md

skills/eval-setup/SKILL.md

- Guard __main__.py main() with if __name__ check - Add bounds check for sys.argv in state.py commands - Log parse failures in discover.py instead of swallowing exceptions - Replace MD5 with SHA256 for skill hash (discover.py, __main__.py) - Add language identifiers to fenced code blocks (eval-optimize, eval-run) - Handle unknown runner with clear error in execute.py - Remove stale args.settings reference in execute.py - Handle empty case dirs in score_cases() to prevent ThreadPoolExecutor crash - Remove extraneous f-string prefix in score.py - Catch malformed YAML/JSON in workspace.py _read_input() - Use parsed config path instead of hardcoded eval.yaml in eval-run SKILL.md - Remove stderr suppression in eval-setup SKILL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

skills/eval-run/scripts/score.py (1)

327-327: ⚠️ Potential issue | 🟡 Minor

Empty case_ids causes ThreadPoolExecutor(max_workers=0) failure.

compare_runs lacks the early-return guard present in score_cases (line 125). When case_ids is empty, parallelism becomes 0.

Proposed fix

 def compare_runs(run_a_dir, run_b_dir, config, case_ids,
                  prompt=None, prompt_file=None, model="claude-sonnet-4-6"):
     """Compare two runs using position-swapped LLM judge."""
+    if not case_ids:
+        return {"run_a": run_a_dir.name, "run_b": run_b_dir.name,
+                "cases_compared": 0, "wins_a": 0, "wins_b": 0,
+                "ties": 0, "errors": 0, "per_case": []}
     comparison_prompt = prompt

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 327, compare_runs computes
parallelism as min(len(case_ids), os.cpu_count() or 4) which yields 0 when
case_ids is empty and causes ThreadPoolExecutor(max_workers=0) to fail; add the
same early-return guard used in score_cases (or otherwise ensure parallelism is
at least 1) so compare_runs returns immediately when case_ids is empty (or sets
parallelism = max(1, min(len(case_ids), os.cpu_count() or 4))). Locate the
comparison in the compare_runs function and implement the empty-case guard or
the min-1 clamp accordingly.

🧹 Nitpick comments (10)

agent_eval/agent/claude_code.py (1)

178-183: Class attribute should be immutable.

_SAFE_ENV_KEYS is a mutable set. Use frozenset to prevent accidental modification.

Proposed fix

     # Environment keys safe to forward to evaluated skills
-    _SAFE_ENV_KEYS = {
+    _SAFE_ENV_KEYS = frozenset({
         "PATH", "HOME", "USER", "SHELL", "LANG", "LC_ALL", "TERM",
         "ANTHROPIC_API_KEY", "ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION",
         "MLFLOW_TRACKING_URI", "MLFLOW_EXPERIMENT_NAME",
         "CLAUDE_CODE_SUBAGENT_MODEL",
-    }
+    })

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 178 - 183, The class attribute
_SAFE_ENV_KEYS is currently a mutable set; change it to an immutable frozenset
to prevent accidental modification by replacing the set literal with a
frozenset(...) construction (e.g., _SAFE_ENV_KEYS = frozenset({ "PATH", "HOME",
"USER", "SHELL", "LANG", "LC_ALL", "TERM", "ANTHROPIC_API_KEY",
"ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION", "MLFLOW_TRACKING_URI",
"MLFLOW_EXPERIMENT_NAME", "CLAUDE_CODE_SUBAGENT_MODEL", })) ensuring all
existing references to _SAFE_ENV_KEYS continue to work.

skills/eval-run/scripts/collect.py (2)

112-112: Remove redundant import re—already imported at module level (line 16).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` at line 112, There is a redundant
duplicate "import re" in the module; remove the second occurrence (the extra
import statement seen in the diff) so only the single module-level "import re"
remains (the original import is already at the top of collect.py), leaving no
duplicate imports in the file.
122-124: Document the 50% threshold heuristic for prefix grouping.

When len(prefixes) >= num_cases * 0.5, prefix-based grouping activates even if nearly half the files lack a matching prefix. This could silently drop files or produce unexpected groupings. Consider requiring exact match (len(prefixes) == num_cases) or logging when files are excluded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` around lines 122 - 124, The current
heuristic in the prefix grouping block uses len(prefixes) >= num_cases * 0.5
which can silently exclude up to half the files; update the logic in collect.py
around the prefixes / num_cases check to either require an exact match (use
len(prefixes) == num_cases) or, if you want to keep a relaxed threshold, add
explicit logging that lists which files were excluded and why before returning
(use the existing prefixes dict and sorted(prefixes.keys()) to compute excluded
entries), so callers can see when files are dropped and why.

agent_eval/state.py (2)

92-93: Rename ambiguous loop variable l to line.

Ruff E741 flags single-letter l as visually confusable with 1.

Proposed fix

-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 92 - 93, Rename the ambiguous
list-comprehension loop variable `l` to a clearer name like `line` to avoid
confusable single-letter identifiers; update the expression to ids =
[line.strip() for line in path.read_text().splitlines() if line.strip()] and
leave the subsequent print(" ".join(ids)) unchanged so behavior remains
identical.

95-97: clean removes relative tmp/ from CWD—document or parameterize.

If the script is invoked from an unexpected working directory, shutil.rmtree("tmp") could delete unintended data. Consider accepting an explicit path argument or resolving against a known base.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 95 - 97, The "clean" branch currently calls
shutil.rmtree("tmp") which deletes a relative tmp/ from the current working
directory (in function/branch where cmd == "clean"), risking removal of
unintended files; change it to accept or resolve an explicit path instead: add a
parameter or CLI arg for the tmp directory (or compute a safe base like the
repository/app root using __file__ or a BASE_DIR constant) and call
shutil.rmtree on the resolved absolute path (use os.path.abspath/os.path.join)
and validate it is inside the intended base before deleting; reference the cmd
== "clean" branch and the shutil.rmtree("tmp", ignore_errors=True) call when
making the change.

skills/eval-analyze/scripts/discover.py (1)

32-34: Frontmatter parsing fails on files with multiple --- separators in body.

content.split("---")[1] grabs text between the first two --- markers, but if the body also contains --- (e.g., horizontal rules), the split behavior is correct. However, if frontmatter itself is malformed with extra ---, yaml.safe_load will silently parse partial content. Consider using content.split("---", 2) and validating len(parts) >= 3 as done in check_eval_md.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/discover.py` around lines 32 - 34, Frontmatter
parsing in discover.py uses content.split("---")[1] which can mis-handle
multiple separators; change it to split with a max of 2 (use
content.split("---", 2)), verify that parts length is >= 3 (matching the pattern
used in check_eval_md), then pass parts[1] to yaml.safe_load and only extract
desc = fm.get("description", "")[:80] when parsing succeeds; if validation
fails, skip or handle as a non-frontmatter file to avoid silent partial YAML
parsing.

skills/eval-run/scripts/workspace.py (1)

134-138: Move import json to module level.

Importing inside the loop on each .json file iteration is wasteful.

Proposed fix

 import argparse
+import json
 import re

Then remove line 135.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 134 - 138, The inline
import inside the file-iteration block should be moved to the module level:
remove the "import json" from inside the elif branch in workspace.py (the block
that checks name.is_file() and name.suffix == ".json") and add a single "import
json" at the top of the module; keep the rest of the logic that opens and
json.load(f) unchanged (refer to the branch handling name.is_file() /
name.suffix == ".json").

skills/eval-run/scripts/score.py (2)

557-557: Remove extraneous f prefix on static string (F541).

-        print(f"No summary found. Run judges first.", file=sys.stderr)
+        print("No summary found. Run judges first.", file=sys.stderr)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 557, The print call currently uses
an unnecessary f-string: replace print(f"No summary found. Run judges first.",
file=sys.stderr) with a plain string print("No summary found. Run judges
first.", file=sys.stderr) to remove the extraneous `f` prefix (fix F541) in the
scoring script.

217-217: Chain exception with raise ... from err (B904).

-            raise ImportError("mlflow[genai] required for LLM judges")
+            raise ImportError("mlflow[genai] required for LLM judges") from None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 217, The ImportError raised with
the message "mlflow[genai] required for LLM judges" should preserve the original
exception context by chaining it; modify the except block that detects the
missing/failed import so that you capture the original exception (e.g., "except
Exception as err:" or the specific caught exception) and replace the plain
"raise ImportError(\"mlflow[genai] required for LLM judges\")" with "raise
ImportError(\"mlflow[genai] required for LLM judges\") from err" so the original
error is propagated (locate the raise in score.py where that exact message is
used).

skills/eval-optimize/SKILL.md (1)

63-63: Define a deterministic rollback procedure before edits; “revert” is currently non-actionable.

Line 63 requires revert-on-regression, but the steps never require creating a backup/snapshot before editing SKILL.md. In iterative loops, this can make regressions irreversible and contaminate subsequent iterations.

Use a mandatory pre-edit checkpoint step (copy original file or patch snapshot) and a concrete rollback action tied to that checkpoint.

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` at line 63, The "f. Check for regressions"
step mentions "revert" but lacks a deterministic rollback procedure; before any
edits to SKILL.md add an explicit pre-edit checkpoint (e.g., create a
timestamped backup copy or create a git branch/patch snapshot) and update the
checklist to require creating that checkpoint; then replace the vague "revert"
action with a concrete rollback instruction that restores the checkpoint
(restore the backup file or checkout the branch/apply the patch) tied to the
checkpoint name so regressions can be undone reliably.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/__main__.py`:
- Line 116: yaml.safe_load(data_yaml) can return None which later causes
data.get("frontmatter", data) to raise AttributeError; after calling
yaml.safe_load in the module (the variable `data`), coerce None (or non-dict
results) to a dict before using .get — e.g., replace the direct assignment with
a guarded one such as assigning data = yaml.safe_load(data_yaml) or {} and if
not isinstance(data, dict): data = {'__raw__': data} (or simply {}) so
subsequent calls to data.get("frontmatter", data) are safe.

In `@agent_eval/agent/claude_code.py`:
- Around line 51-53: The timeout_s parameter is declared but never used: replace
the hardcoded 30s timeout in the subprocess invocation with the timeout_s
argument and update the subprocess timeout handling/error message to reference
timeout_s; specifically, locate the subprocess.run call (where timeout=30 is
currently hardcoded) and change it to use timeout=timeout_s, and ensure the
exception logging that currently mentions timeout_s actually uses the same
timeout_s variable so the message matches the runtime behavior.
- Around line 137-141: The current except Exception as e block in claude_code.py
(the handler that builds a RunResult using start/duration and stderr=str(e)) is
too broad; update it to catch only the expected exceptions (e.g.,
subprocess.CalledProcessError, TimeoutError, OSError) and return a RunResult for
those, and ensure KeyboardInterrupt and SystemExit (or BaseException) are
re-raised rather than swallowed—alternatively add a dedicated except
(KeyboardInterrupt, SystemExit): raise before the narrower except so only
intended errors are caught while preserving proper signal/interrupt propagation.

In `@agent_eval/config.py`:
- Around line 140-143: The project_root property currently returns Path.cwd(),
which breaks relative-path resolution when a config is loaded from a file in a
different directory; update the config loader (e.g., EvalConfig.from_yaml) to
capture the YAML file's parent directory into a stored attribute (e.g.,
self.config_dir) and change project_root to return that stored config_dir (the
parent of the loaded YAML path) instead of Path.cwd(); ensure project_root falls
back to Path.cwd() only if no file path was provided.

In `@skills/eval-optimize/SKILL.md`:
- Around line 45-53: Validate and constrain any dynamic identifiers before using
them in filesystem paths: enforce an allowlist regex (e.g., ^[A-Za-z0-9._-]+$)
for config.skill and run-id, reject values containing path separators or "..",
canonicalize with realpath/resolution, and verify the resolved paths remain
under the expected root directories (e.g., `.claude/skills/` for config.skill
and `eval/runs/` for run-id) before performing any read/write (references:
config.skill, run-id, `.claude/skills/<skill>/SKILL.md`,
`eval/runs/<id>-iter-<N>/summary.yaml`, stdout.log).

In `@skills/eval-run/scripts/score.py`:
- Around line 191-203: The function _make_inline_check currently compiles and
execs jc.check with full __builtins__, enabling arbitrary code execution; remove
the direct exec-based approach and either (A) disallow inline checks by raising
an error when jc.check is present (so callers must supply safe, pre-defined
check functions), or (B) run the inline check inside a sandboxed subprocess:
spawn a separate Python process (not exec in-process) that receives the outputs
via stdin (e.g., JSON), executes the provided jc.check string in a restricted
environment with no __builtins__/empty env, resource limits, and no
network/filesystem access, and returns the result via stdout; update
_make_inline_check to create a scorer that communicates with that subprocess
instead of calling check_fn directly, and remove passing __builtins__ into exec
to avoid in-process evaluation.

---

Duplicate comments:
In `@skills/eval-run/scripts/score.py`:
- Line 327: compare_runs computes parallelism as min(len(case_ids),
os.cpu_count() or 4) which yields 0 when case_ids is empty and causes
ThreadPoolExecutor(max_workers=0) to fail; add the same early-return guard used
in score_cases (or otherwise ensure parallelism is at least 1) so compare_runs
returns immediately when case_ids is empty (or sets parallelism = max(1,
min(len(case_ids), os.cpu_count() or 4))). Locate the comparison in the
compare_runs function and implement the empty-case guard or the min-1 clamp
accordingly.

---

Nitpick comments:
In `@agent_eval/agent/claude_code.py`:
- Around line 178-183: The class attribute _SAFE_ENV_KEYS is currently a mutable
set; change it to an immutable frozenset to prevent accidental modification by
replacing the set literal with a frozenset(...) construction (e.g.,
_SAFE_ENV_KEYS = frozenset({ "PATH", "HOME", "USER", "SHELL", "LANG", "LC_ALL",
"TERM", "ANTHROPIC_API_KEY", "ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION",
"MLFLOW_TRACKING_URI", "MLFLOW_EXPERIMENT_NAME", "CLAUDE_CODE_SUBAGENT_MODEL",
})) ensuring all existing references to _SAFE_ENV_KEYS continue to work.

In `@agent_eval/state.py`:
- Around line 92-93: Rename the ambiguous list-comprehension loop variable `l`
to a clearer name like `line` to avoid confusable single-letter identifiers;
update the expression to ids = [line.strip() for line in
path.read_text().splitlines() if line.strip()] and leave the subsequent print("
".join(ids)) unchanged so behavior remains identical.
- Around line 95-97: The "clean" branch currently calls shutil.rmtree("tmp")
which deletes a relative tmp/ from the current working directory (in
function/branch where cmd == "clean"), risking removal of unintended files;
change it to accept or resolve an explicit path instead: add a parameter or CLI
arg for the tmp directory (or compute a safe base like the repository/app root
using __file__ or a BASE_DIR constant) and call shutil.rmtree on the resolved
absolute path (use os.path.abspath/os.path.join) and validate it is inside the
intended base before deleting; reference the cmd == "clean" branch and the
shutil.rmtree("tmp", ignore_errors=True) call when making the change.

In `@skills/eval-analyze/scripts/discover.py`:
- Around line 32-34: Frontmatter parsing in discover.py uses
content.split("---")[1] which can mis-handle multiple separators; change it to
split with a max of 2 (use content.split("---", 2)), verify that parts length is
>= 3 (matching the pattern used in check_eval_md), then pass parts[1] to
yaml.safe_load and only extract desc = fm.get("description", "")[:80] when
parsing succeeds; if validation fails, skip or handle as a non-frontmatter file
to avoid silent partial YAML parsing.

In `@skills/eval-optimize/SKILL.md`:
- Line 63: The "f. Check for regressions" step mentions "revert" but lacks a
deterministic rollback procedure; before any edits to SKILL.md add an explicit
pre-edit checkpoint (e.g., create a timestamped backup copy or create a git
branch/patch snapshot) and update the checklist to require creating that
checkpoint; then replace the vague "revert" action with a concrete rollback
instruction that restores the checkpoint (restore the backup file or checkout
the branch/apply the patch) tied to the checkpoint name so regressions can be
undone reliably.

In `@skills/eval-run/scripts/collect.py`:
- Line 112: There is a redundant duplicate "import re" in the module; remove the
second occurrence (the extra import statement seen in the diff) so only the
single module-level "import re" remains (the original import is already at the
top of collect.py), leaving no duplicate imports in the file.
- Around line 122-124: The current heuristic in the prefix grouping block uses
len(prefixes) >= num_cases * 0.5 which can silently exclude up to half the
files; update the logic in collect.py around the prefixes / num_cases check to
either require an exact match (use len(prefixes) == num_cases) or, if you want
to keep a relaxed threshold, add explicit logging that lists which files were
excluded and why before returning (use the existing prefixes dict and
sorted(prefixes.keys()) to compute excluded entries), so callers can see when
files are dropped and why.

In `@skills/eval-run/scripts/score.py`:
- Line 557: The print call currently uses an unnecessary f-string: replace
print(f"No summary found. Run judges first.", file=sys.stderr) with a plain
string print("No summary found. Run judges first.", file=sys.stderr) to remove
the extraneous `f` prefix (fix F541) in the scoring script.
- Line 217: The ImportError raised with the message "mlflow[genai] required for
LLM judges" should preserve the original exception context by chaining it;
modify the except block that detects the missing/failed import so that you
capture the original exception (e.g., "except Exception as err:" or the specific
caught exception) and replace the plain "raise ImportError(\"mlflow[genai]
required for LLM judges\")" with "raise ImportError(\"mlflow[genai] required for
LLM judges\") from err" so the original error is propagated (locate the raise in
score.py where that exact message is used).

In `@skills/eval-run/scripts/workspace.py`:
- Around line 134-138: The inline import inside the file-iteration block should
be moved to the module level: remove the "import json" from inside the elif
branch in workspace.py (the block that checks name.is_file() and name.suffix ==
".json") and add a single "import json" at the top of the module; keep the rest
of the logic that opens and json.load(f) unchanged (refer to the branch handling
name.is_file() / name.suffix == ".json").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0b3fcd89-3363-4b9f-80ec-21470da4a1db

📥 Commits

Reviewing files that changed from the base of the PR and between 4c6526f and 9219ee1.

📒 Files selected for processing (13)

.gitignore
agent_eval/__main__.py
agent_eval/agent/claude_code.py
agent_eval/config.py
agent_eval/state.py
skills/eval-analyze/scripts/discover.py
skills/eval-optimize/SKILL.md
skills/eval-run/SKILL.md
skills/eval-run/scripts/collect.py
skills/eval-run/scripts/execute.py
skills/eval-run/scripts/score.py
skills/eval-run/scripts/workspace.py
skills/eval-setup/SKILL.md

✅ Files skipped from review due to trivial changes (1)

.gitignore

🚧 Files skipped from review as they are similar to previous changes (2)

skills/eval-run/scripts/execute.py
skills/eval-setup/SKILL.md

agent_eval/__main__.py

agent_eval/agent/claude_code.py

agent_eval/config.py

skills/eval-optimize/SKILL.md

skills/eval-run/scripts/score.py

astefanutti · 2026-04-03T18:30:46Z

/approve

Scaffold agent eval harness skills

4c6526f

astefanutti force-pushed the pr-01 branch from bc48951 to 4c6526f Compare April 3, 2026 17:57

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

astefanutti self-assigned this Apr 3, 2026

astefanutti requested a review from jwforres April 3, 2026 18:32

Conversation

astefanutti commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Skills

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Security Issues

Code Quality & Logic Gaps

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astefanutti commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

astefanutti commented Apr 3, 2026 •

edited

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading