(Served as both AGENTS.md and CLAUDE.md — the latter is a symlink.)
You are working in cube-harness, the runtime that executes agents against CUBE
benchmarks and records trajectories. This file is your map; it is deliberately short.
Read the relevant spec in openspec/specs/ before modifying any layer.
cube-harness runs experiments. It consumes the contracts defined by cube-standard
(Task, Benchmark, Tool, Resource) and adds: agents, episode loops,
trajectory storage, OTel tracing, parallel execution (Ray), the XRay viewer, and
MCP server bridges.
It does NOT define the task/benchmark/tool protocol — that's cube-standard. If you're
tempted to change base class signatures (Task.step, Benchmark.setup, etc.), you're
in the wrong repo; go to cube-standard and start with an openspec change proposal.
src/cube_harness/
├── core.py # AgentOutput, Trajectory, TrajectoryStep, ActionSpace
├── agent.py # AgentConfig, Agent (abstract)
├── llm.py # LLM, LLMConfig, Prompt, LLMCall, Usage (LiteLLM wrapper)
├── episode.py # Episode, EpisodeConfig, MAX_STEPS
├── experiment.py # Experiment, ExpResult
├── exp_runner.py # run_sequentially, run_with_ray
├── storage.py # Storage Protocol, FileStorage (V2 + V1 fallback)
├── summary.py # SummaryProcessor, ExperimentSummary
├── episode_logs.py # Per-episode stdout/stderr redirection
├── utils.py # parse_actions, HTML pruning, misc
├── results.py # Higher-level result types
├── agents/
│ ├── react.py # ReAct agent (primary)
│ ├── genny.py # Genny agent (context-aware, rolling summaries)
│ └── legacy_generic_agent.py # Deprecated XML-tag agent — see DEPRECATED.md
├── action_spaces/ # Protocol definitions for action sets
├── benchmarks/ # Legacy in-tree benchmarks (miniwob, workarena) — most now live in cubes/
├── metrics/tracer.py # OpenTelemetry tracer, Ray env-var propagation
├── analyze/
│ ├── investigator/ # Per-trajectory blame; use_cases/{general_blame, profiling, agent_scaffolding, hinter, fix_audit}
│ ├── xray.py # Gradio-based XRay viewer
│ ├── inspect_results.py # CLI-ish inspection helpers
│ └── xray_utils.py
├── auto_cube/ # Auto-CUBE outer-loop methodology; use_cases/<name>/ each with SKILL.md (loaded by /auto-cube-<name>)
└── mcp/ # Serve harness tools AS an MCP server
├── server.py
└── convert.py
cubes/ # External benchmark packages (arithmetic, osworld, swebench-*, terminalbench, webarena-verified, workarena, miniwob)
recipes/ # Example experiment scripts
tests/ # pytest suite
Each spec is the authoritative contract for its layer.
| Layer | Module | Spec |
|---|---|---|
| Core types (Trajectory, AgentOutput) | cube_harness.core |
core/spec.md |
| Agent | cube_harness.agent |
agent/spec.md |
| LLM wrapper | cube_harness.llm |
llm/spec.md |
| Episode | cube_harness.episode |
episode/spec.md |
| Experiment + runners | cube_harness.experiment, cube_harness.exp_runner |
experiment/spec.md |
| Storage | cube_harness.storage, cube_harness.summary |
storage/spec.md |
| Metrics / OTel | cube_harness.metrics |
metrics/spec.md |
| XRay viewer | cube_harness.analyze |
analyze/spec.md |
| MCP server | cube_harness.mcp |
mcp/spec.md |
External contracts (cube-standard): Any field typed as cube.task.Task,
cube.benchmark.Benchmark, cube.tool.Tool, cube.core.*, or cube.resource.*
is governed by cube-standard's specs. Don't subclass those here — consume them.
- Read the spec first. Before touching any layer, read its spec in
openspec/specs/. Specs are the authoritative design intent — but they can be stale or wrong; flag discrepancies rather than silently working around them. - Fix in the right place. A quick local experiment to understand a problem is fine. But the committed fix must address the root cause in the correct layer — not a workaround scoped to a single call site or context.
- Understand before fixing. Many bad fixes come from acting too fast. Make sure you understand the broader design before proposing a change. A fix that misses the bigger picture is worse than no fix.
- Lean diffs. Make the minimal change that solves the problem. Avoid verbose additions, unnecessary abstractions, and duplicated logic that already exists elsewhere. If existing code can be reused or consolidated, do it. A hard-to-review diff is a liability.
- Think long-term. Every change should age well. Ask whether today's shortcut becomes tomorrow's debt — and whether the design could evolve cleanly if requirements change.
CUBE spans several repos, so a local view rarely tells the whole story. Build the wider picture before planning a change or making a call:
- Trace real usage, not just the definition —
Grepcall sites, subclasses, and tests across the repo (incl.cubes/*,recipes/*). - Read the spec and the code together — the spec is intent (can be stale); the code is what runs.
- Mind the repo boundary — cube-harness consumes cube-standard's
cube.*contracts, so their signature changes belong upstream; core changes (core.py/agent.py/llm.py) ripple into every cube and recipe. - Fan out with subagents (
Explore,general-purpose) for broad searches — keep the conclusion without burning context.
Default branch is dev — base all PRs off it, not main.
Sign your commits. Every commit needs a Signed-off-by line (git commit -s). DCO is enforced by CI — unsigned commits will be blocked.
PRs are reviewed with /code-review (plugin docs), which audits changes against these guidelines. Write PRs as if a reviewer will check each principle above against the diff.
Auto-fix provenance. Auto-CUBE-produced fixes carry # auto-fix(N)↓ … # /auto-fix(N)
markers + a one-line machine-readable footnote at module bottom (N = PR number
for L0/L1, design-debt issue number for L2/L3). When a diff touches an auto-fix
region or its footnote, treat it as a possibly-rotten marker (review rule AF-001).
Methodology (Fix Report, L0–L3 tiers, rot lint):
openspec/specs/auto-fix/spec.md. Human
entry point for running the loop:
src/cube_harness/auto_cube/README.md
(use-cases live at src/cube_harness/auto_cube/use_cases/<name>/; the
default is debug, invoked as /auto-cube or /auto-cube-debug).
- Find the relevant spec — which layer? Start there.
- Check "Invariants" and "Gotchas" — these are the traps.
- Check
openspec/changes/— someone may already be proposing your change. - For breaking or multi-invariant contract changes, open
openspec/changes/<name>/(proposal.md+deltas.md, ADDED / MODIFIED / REMOVED) before coding; additive changes just edit the spec. Keep proposals concise — see openspec/README.md. Archive toopenspec/changes/archive/YYYY-MM-DD-<name>/when done. - Constitution alignment: every change is reviewed against the constitution and review rules.
- Python is the configuration — no YAML/Hydra.
AgentConfig,LLMConfig,Experimentare all PydanticTypedBaseModel. - LiteLLM is the only LLM gateway — never import
openai,anthropic, etc. directly. - Module-level imports only — no function-scoped imports (EX-001).
- Type hints required everywhere, including tests (CC-001).
- Serialization boundary: Workers receive
TaskConfig+EpisodeConfig(pickled). LiveTask,Tool,Benchmark,Agentobjects never cross process boundaries. - Trajectory steps alternate env → agent → env → agent in persistence order.
- Trace-first: every new long-running operation should get a
tracer.span(). - CLIs use Typer — new scripts/recipes that need a CLI should use
typer.run(main)withtyper.Option-annotated args (FastAPI-style: type hints + docstring become--help).scripts/experiments_report.pyis the canonical example. Don't add newargparseboilerplate.
make install # uv sync --all-extras
make test # full pytest
make debug # small end-to-end run
make xray # open the trajectory viewer
make report # markdown table of experiments in ~/cube_harness_results/ (forward args with ARGS="--last 10")
make lint # uvx ruff check --fix && uvx ruff format (auto-fixes in place)
make lint-check # uvx ruff check --diff && uvx ruff format --diff (read-only, what CI runs)
make review PR=<n> # check out a PR and wire up any cross-repo cube-standard dependency
uv run recipes/hello_miniwob.py # example run| Type | When | Where |
|---|---|---|
Unit (pytest tests/) |
every iteration | tests/ — fast, no external deps. What CI runs by default (-m "not slow and not live_api"). |
slow (pytest -m slow) |
when touching the marked area | tests/ with @pytest.mark.slow. Excluded from CI by default — Ray-based timing tests are flaky on shared GitHub Actions runners; reliable locally. |
integration (pytest -m integration) |
when touching the marked area | tests/ with @pytest.mark.integration. Setup details (Playwright install, etc.) live in the marker's docstring in pyproject.toml. |
live_api (pytest -m live_api) |
when touching the marked area | tests/ with @pytest.mark.live_api. Hits a real LLM provider; costs money; auto-skips without ANTHROPIC_API_KEY; never runs in CI. |
Smoke (scripts/smoke/*.py) |
when a PR touches plumbing unit tests can't reach | Standalone scripts a coding agent runs to verify end-to-end behavior. Never CI. May stand up real infrastructure or call external APIs; minutes-long runs are fine. Each prints SMOKE OK/FAIL/SKIP: <name> (exit 0/1/2). Discover with find . -path '*/scripts/smoke/*.py'. |
Smokes are the coding agent's judgment call — for a PR that touches a marked area, pick the relevant smokes, adapt the environment (auth, credentials, profiles), and iterate until green. Reflex: when adding complex new code, drop a smoke alongside it; a green end-to-end run is the strongest signal the change actually works as intended.
Always run make lint before finishing a task. ruff check and ruff format are
separate passes — running only one is not enough for CI.
Environment vars go in .env (loaded by pyproject). OPENAI_API_KEY
is the only required one for the baseline recipes; see individual cubes for others.
Ray launch: use .venv/bin/python recipe.py or uv run --active recipe.py — never bare
uv run when VIRTUAL_ENV is set. uv will silently create an ephemeral env whose .pth files
can point to deleted paths, causing ImportError on Ray workers. exp_runner.py warns when it
detects this.
When a PR depends on an unreleased cube-standard branch, do not commit
path = "..." local sources to any pyproject.toml (root or under cubes/).
The pre-commit hook (.githooks/pre-commit) will block this — local paths break
for anyone with a different folder structure.
Authoring a cross-repo PR:
- Keep
pyproject.tomlpointing at PyPI (or a git ref) — do not commit the local path. - Add a line starting with
Depends-on: cube-standard/<branch-name>to the PR description body. The line must start at column 0 (no leading whitespace, not inside a list or quote block).
Reviewing a cross-repo PR:
make review PR=<n>This checks out the PR branch, reads Depends-on: from the PR description, clones
cube-standard into the repo root (gitignored), checks out the correct branch, and
installs all workspace packages with uv pip install -e cube-standard --all-packages --all-extras.
- cube-standard — protocol and base classes. Never subclass cube-standard ABCs here without first updating cube-standard's openspec if needed.
- cube-registry — public metadata registry;
cube registry addsubmits entries. - cubes/* — individual benchmark packages. Each has its own
debug.pythatcube test <name>runs. Changes to a cube are usually local to its directory.
src/cube_harness/analyze/investigator/use_cases/<name>/ is the investigator recipe
catalog. Each subdirectory is one use case:
general_blame— default. Closed-world blame attribution per episode.profiling— narrower taxonomy aimed at scaffold-level inefficiency.agent_scaffolding— deep loop-pathology diagnosis.hinter— extracttask_hints[task_id]candidates from failed episodes.
Each use_case has a recipe.py (Pydantic InvestigatorRecipe) and a SKILL.md
(skill description for the Auto-CUBE orchestrator that dispatches it).
scripts/sync_investigator_skills.py symlinks SKILL.md files into
.claude/skills/investigator-<name> so Claude Code picks them up.
Per-batch synthesis (meta_analysis.json + .md) is mirrored into
<journal-dir>/<experiment>/ (default ~/auto_cube/; Auto-CUBE points it at
the per-session ~/auto_cube/<session-id>/journal/) for cross-iteration
narrative — the only artefact the Investigator writes outside the experiment dir.
src/cube_harness/auto_cube/use_cases/<name>/ is the outer-loop
use-case catalog. Each subdirectory holds a SKILL.md (loaded as the
Auto-CUBE agent's system prompt) and an optional investigator_extra.md
(biasing fragment appended to per-episode Investigator prompts via
InvestigationConfig.extra_prompt_fragment / ch-investigate --extra-prompt).
debug— default. Curious-scientist methodology, sparse coverage acrosstask × infra × tool × model × agent-config, ships Fix Report PRs. Invoked as/auto-cube(alias) or/auto-cube-debug.hinter— raises benchmark performance by adding knowledge at the right regularization level (low-reg task-hint cheat → promoted task clarification / benchmark prompt / action description / new action / system prompt). Invoked as/auto-cube-hinter.
scripts/sync_auto_cube_skills.py symlinks each SKILL.md into
.claude/skills/auto-cube-<name>/ and creates the auto-cube → debug
alias. To create a new use-case, invoke /new-auto-cube-use-case.