sys-intelligence · xuafeng · Feb 9, 2026 · Feb 6, 2026
diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
@@ -34,68 +34,127 @@ Using WASABI's [agent evaluator](data/benchmark/sosp24_wasabi/wasabi/_agent_eval
 
 1. An `_agent_eval/` package which contains all benchmark-specific code and does *not* modify your original artifact logic.
 
-2. One oracle module per stage, implemented in four distinct Python files each checking one of the four canonical stages of artifact evaluation. A typical oracle module looks as follows (simplified):
+2. One oracle module per stage. In this benchmark, each stage is typically implemented as a **derived oracle class** that overrides `requirements()` and returns an ordered list of programmatic checks (requirements). The base oracle handles running requirements, producing a structured report, printing a PASS/FAIL summary, and returning `True`/`False` from `run(verbose=...)`.
+
+  A typical `_agent_eval/` layout looks like:
+
+   ```text
+   _agent_eval/
+   ├── main.py
+   ├── oracle_env_setup.py
+   ├── oracle_build_install.py
+   ├── oracle_prep_benchmark.py
+   ├── oracle_run_experiments.py
+   └── refs/
+       ├── datasets.ref.json
+       └── results.ref.json
+   ```
+
+   The `refs/` directory stores machine-checkable ground truth (e.g., dataset manifests/checksums, expected metric tables, or summaries of deterministic outputs) used by benchmark-prep and experiment-runs checks.
+
+   Here is a simplified environment setup oracle (one dependency/version requirement):
+
+   ```python
+   # _agent_eval/oracle_env_setup.py
+   import sys
+   from collections.abc import Sequence
+
+   from evaluator.oracle_env_setup_primitives import (
+       DependencyVersionRequirement,
+       OracleEnvSetupBase,
+       VersionCompare,
+   )
+
+   class OracleEnvSetup(OracleEnvSetupBase):
+       def __init__(self, *, config, logger):
+           super().__init__(logger=logger)
+           self._config = config
+
+       def requirements(self) -> Sequence[DependencyVersionRequirement]:
+           return (
+               DependencyVersionRequirement(
+                   name="python_version",
+                   cmd=(sys.executable, "--version"),
+                   required_version=(3, 10, 0),
+                   compare=VersionCompare.GEQ,
+                   timeout_seconds=5.0,
+               ),
+           )
+   ```
+
+   Also, note that each oracle should be:
+   - Non-interactive, meaning not expecting input or prompt interactions.
+   - Idempotent, meaning safe to run multiple times without side-effects.
+   - Time-bounded, meaning every command has a timeout.
+   - Binary, meaning it returns pass/fail (as `True`/`False`) for the stage.
+
+  For more details, check out this [how-to guide](src/evaluator/HOWTO.md)
+
+1. A single `main.py` orchestrator, the entrypoint used by ArtEvalBench, which constructs an `EntryConfig`, invokes the four oracles in order, and returns an overall score (an integer between 0 and 4):
+
    ```python
-   # _agent_eval/env_setup.py
-   import subprocess
+   # _agent_eval/main.py
+   import os
    from pathlib import Path
 
-   def check() -> bool:
-       # Example: verify virtualenv exists
-       if not Path("venv").exists():
-           print("Missing venv/ directory")
-           return False
-
-       # Example: verify Python version inside the venv
-       proc = subprocess.run(
-           ["venv/bin/python", "--version"],
-           capture_output=True,
-           text=True,
+   from evaluator.utils import EntryConfig, LoggerConfig, get_logger, record_result
+
+   from oracle_env_setup import OracleEnvSetup
+   from oracle_build_install import OracleBuildInstall
+   from oracle_prep_benchmark import OraclePrepBenchmark
+   from oracle_run_experiments import OracleRunExperiments
+
+   CONFIG = EntryConfig(
+       name="my-artifact",
+       home_dir=Path.home() / "artevalbench" / "my-artifact",
+       repository_paths={
+           "my-artifact": Path.home() / "artevalbench" / "my-artifact" / "repo",
+       },
+       results_paths={
+           "results": Path.home() / "artevalbench" / "my-artifact" / "repo" / "outputs" / "results.json",
+       },
+       ground_truth_paths={
+           "datasets": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "datasets.ref.json",
+           "results": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "results.ref.json",
+       },
+       similarity_ratio=0.75,
+   )
+
+   def main(argv: list[str]) -> int:
+       verbose = "--verbose" in argv
+       logger = get_logger(
+           LoggerConfig(root_name=os.environ.get("EVAL_LOGGER_NAME", "ARTEVAL-EVAL"))
+       )
+
+       results: dict[str, int] = {}
+       score = 0
+
+       score += record_result(
+           results, "env_setup",
+           OracleEnvSetup(config=CONFIG, logger=logger).run(verbose=verbose),
+       )
+       score += record_result(
+           results, "build_install",
+           OracleBuildInstall(config=CONFIG, logger=logger).run(verbose=verbose),
+       )
+       score += record_result(
+           results, "prep_benchmark",
+           OraclePrepBenchmark(config=CONFIG, logger=logger).run(verbose=verbose),
        )
-       print(proc.stdout.strip())
-       return proc.returncode == 0 and proc.stdout.startswith("Python 3.10")
-    ```
-    Also, note that each oracle should be:
-    - Non-interactive, meaning not expecting input or prompt interactions.
-    - Idempotent, meaning safe to run multiple times without side-effects.
-    - It returns `True` or `False` based on the validation outcome and prints a brief diagnostic message.
-
-3. A single `main.py` orchestrator, the entrypoint used by ArtEvalBench, which invokes the four oracle modules, runs them in order, and returns an overall score (an integer between 0 and 4):
-    ```python
-    # _agent_eval/main.py
-    from . import env_setup, build_install, prep_benchmark, run_experiments
-
-    def main() -> int:
-        score = 0
-        stages = [
-            ("env_setup", env_setup.check),
-            ("build_install", build_install.check),
-            ("prep_benchmark", prep_benchmark.check),
-            ("run_experiments", run_experiments.check),
-        ]
-
-        for name, check in stages:
-            try:
-                ok = bool(check())
-            except Exception as e:
-                print(f"[{name}] FAILED with exception: {e}")
-                ok = False
-
-            if ok:
-                print(f"[{name}] PASSED")
-                score += 1
-            else:
-                print(f"[{name}] FAILED")
-
-        print(f"FINAL_SCORE {score}/4")
-        return score
-
-    if __name__ == "__main__":
-        raise SystemExit(main())
-    ```
-
-    Note that the `ArtEvalBench` framework will invoke `main.py` to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
+       score += record_result(
+           results, "run_experiments",
+           OracleRunExperiments(config=CONFIG, logger=logger).run(verbose=verbose),
+       )
+
+       logger.info("Stage scores: %s", results)
+       logger.info("FINAL_SCORE %d/4", score)
+       return score
 
+   if __name__ == "__main__":
+       raise SystemExit(main([]))
+   ```
+
+   Note that the `ArtEvalBench` framework will invoke `main.py` to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
 
 ## Benchmark Setup
 
@@ -105,10 +164,10 @@ To run the benchmark:
 
 1. Execute the `run.sh` script with your model:
 
- ```sh
- ./run.sh <model_name>
- # Example: ./run.sh claude-sonnet-4-5-20250929
- ```
+```sh
+./run.sh <model_name>
+# Example: ./run.sh claude-sonnet-4-5-20250929
+```
 
 2. Configure your LLM endpoint in `env.toml`:
 * For Azure/OpenAI models: Set `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`
@@ -117,7 +176,6 @@ To run the benchmark:
 
 3. Results will be saved to `outputs/` with timestamp and model information
 
-
 #### » Supported Agents
 
 The benchmark supports multiple AI agents:

diff --git a/...rks/arteval_bench/src/evaluator/README.md → ...arks/arteval_bench/src/evaluator/HOWTO.md b/...rks/arteval_bench/src/evaluator/README.md → ...arks/arteval_bench/src/evaluator/HOWTO.md
@@ -1,15 +1,15 @@
 # Agent Evaluator Primitives
 
-This bundle provides primitives for four oracles that verify if an AI agent can succesfully evaluating a set of artifacts, namely setting up, building code, downloading datasets and runing experiments end-to-end. Each oracle corresponds to one stage of the artifact evaluation (AE) process and encodes minimal, objective, and programatically verifiable success criteria. Oracles are designed to be idempotent (safe to run multiple times), non-interactive (no blocking events like I/O actions or manual intervention), and produce a binary outcome (either "pass" or "fail").  
+This utility provides building blocks for four validation oracles that check whether an AI agent can evaluate artifacts end-to-end: set up the environment, build or install the main code modules, download and prepare datasets/benchmarks, and run the experiments. Each oracle matches one cannonical stage of the artifact evaluation (AE) process and defines simple, objective checks that can be verified programatically. The oracles are idempotent (i.e., safe to run multiple times), non-interactive (i.e., no prompts or manual steps), and return a clear result (i.e. "pass" or "fail").  
 
-The oracles verify four canonical stages of the AE process:
+The four canonical stages of the AE process that these oracles validate are as follows:
 
 1. Environment setup: check required tools/dependencies exist and meet version constraints; confirm key environment variables and required files/directories are present.
 2. Artifact build: run build/install commands and fail if they do not complete successfully.
 3. Benchmark preparation: check datasets/benchmarks/tools are present and usable; optionally run quick commands and check for expected output signatures.
 4. Experiment runs: compare observed to reference values using similarity or elementwise checks within cutomizable tolerance thresholds.
 
-Each artifact includes a self-contained oracles in a `_agent_eval/` directory. These scripts extend the base primitives descrived above to create specialized oracles that assert success criteria at each AE stage.
+Each artifact includes a self-contained oracles in a `_agent_eval/` directory. This extra code extends the base primitives described above to create specialized oracles that assert success criteria at each AE stage.
 
 ## Implementing agent evaluators