This directory now has four different eval tiers. They should not be treated as interchangeable.
Files:
evals/tasks/uuid/evals/tasks/httprouter/evals/tasks/semver/evals/harness/
Purpose:
- verify the compare harness
- catch obvious regressions between tags
- confirm Codex treatment actually uses
yoyo
These are not the product-truth benchmark for yoyo. They are too local, too deterministic, and too easy for plain Codex to solve with direct file edits.
Files:
evals/tasks/*.jsonevals/run.pyevals/run_semantic.pyevals/write_run.pyevals/REPORT.md
Purpose:
- validate structural tool correctness
- validate semantic search recall
- validate write-tool safety at the unit-task level
These answer "is the tool correct?" They do not answer "does this make an engineering agent better?"
Files:
evals/tasks/directed_tool_use.jsonevals/tasks/directed_tool_use_first3.jsonevals/tasks/directed_tool_use_write_batch.json
Purpose:
- evaluate
yoyoas an MCP tool suite, not as a stand-alone coding agent - measure whether explicit engineering instructions make
yoyouseful on real codebase work - reward correct use of
boot,index,inspect,search,impact, andchangeunder direction
This should be the primary product benchmark.
yoyo is not itself an autonomous coding agent. It is a collection of structural repo tools exposed over MCP. The main eval question is therefore not:
- "Can treatment autonomously solve the task with no coaching?"
It is:
- "When the model is given realistic engineering instructions, does
yoyohelp it do the work better?"
Directed tool-use tasks should be multi-turn and command-based. Typical phases:
- locate the likely symbols and files
- explain the dependency chain or blast radius
- make the minimal correct edit
- verify with the right test or build command
The evaluator is allowed to give neutral but concrete commands between phases. This is not contamination. It is the product being used as intended.
The directed suite should explicitly cover three task modes:
- read-only
- write-only
- read-then-write
The third should be the primary one, but all three matter.
It should also explicitly include principal-engineer-level questions drawn from the codebase, not just local implementation questions.
directed_tool_use_first3.json is the first mixed-mode pilot set.
directed_tool_use_write_batch.json is the first concrete write-only batch. It currently covers:
ripgrep-global-gitignoreuuidhttproutersemver
Current WIP directed results:
directed-ripgrep-read-only-2026-03-13.mdrecords one clean read-only ripgrep run where Codex stayed onyoyofor22/22tool calls.directed-ripgrep-write-only-2026-03-13.mdrecords the first clean single-task write-only ripgrep result.directed-write-batch-2026-03-13.mdrecords the first clean multi-task write-only batch:ripgrep,uuid, andhttprouter
directed-semver-write-only-2026-03-13.mdkeepssemverseparate:- the patch is correct and passes manual
cargo test - the fixture's exact verify command,
cargo test --test *, is malformed and still needs fixing before strict scoring
- the patch is correct and passes manual
Files:
evals/tasks/realistic_daily_engineering.jsonevals/tasks/realistic_daily_engineering_first5.json
Purpose:
- evaluate
yoyoon work that looks like real codebase engineering - reward repo understanding, ambiguity handling, and safe multi-file change planning
- measure the kinds of tasks where
boot,index,inspect,search,impact, andchangeshould matter
This is still useful, but it should be treated as an integration benchmark, not the primary product benchmark.
realistic_daily_engineering.json defines the task shapes.
realistic_daily_engineering_first5.json is the first concrete batch selected from real merged PRs in public repositories. It is intentionally source-focused first: repo, source PR, pre-fix SHA, reference merge SHA, natural prompt, and verification plan. The remaining work is fixtureizing those tasks so the compare harness can run them reproducibly.
Puncture tasks are useful, but they overweight local bug repair:
- the shortest path is usually "read failing file, patch, rerun tests"
- they do not force structural navigation
- they do not reflect normal engineering ambiguity
- they under-measure impact tracing and safe multi-file edits
- they make native agent tools look stronger than they really are on large codebases
Use them as smoke tests. Do not use them as the main decision-maker for roadmap priority.
Autonomous issue-first runs are still worth keeping, but they are not the right primary test for yoyo.
Why:
yoyois not a planner or agent policy layer- the model may underuse or misuse the tools without explicit instruction
- failures in autonomy can come from prompt policy or tool selection, not tool value
- this over-attributes responsibility to
yoyofor behavior owned by the model/runtime
So the benchmark stack should distinguish:
- tool value under direction
- autonomous integration behavior
The first is the product truth. The second is an ecosystem compatibility check.
Every directed task should satisfy most of these:
- requires at least 3 meaningful repo hops before the correct edit is obvious
- can be split into locate, explain, change, and verify phases
- benefits from structural repo understanding more than raw grep
- includes at least one plausible wrong abstraction or wrong caller path
- has a natural engineering objective, not a synthetic "use this tool now" script
- can be scored on both outcome and intermediate reasoning quality
- forbids access to hidden oracle material through local git history (
git log,git show,git blame, cross-commit diffs, or direct inspection of the merged fix)
If the agent inspects the hidden upstream fix through local history, the run is contaminated and should be discarded.
Purpose:
- test whether
yoyohelps the model find the right symbols, files, and dependency chains - measure structural understanding without giving credit for a lucky patch
Examples:
- "Find the 3 files most likely involved in this bug."
- "Explain what breaks if we change this symbol."
- "Is this helper safe to delete?"
- "Which layer should own this fix and why?"
- "What invariant must remain true if we change this subsystem?"
Primary scoring:
- locate quality
- explanation quality
- blast-radius accuracy
- irrelevant-file mentions
- elapsed time and tool churn
Purpose:
- test whether
yoyohelps land a correct and minimal patch once the target surface is already known - isolate edit quality from exploration quality
Examples:
- "Update this signature and its directly affected callers."
- "Patch this known function with the minimal change."
- "Rename this known symbol safely."
Write-only tasks should force the model to cross into editing once the surface is already known. The command wording should be explicit about using change, keeping the patch minimal, and limiting any extra confirming reads before the first edit.
Write-only tasks should not begin with ownership or blast-radius questions. Those belong in read_only or read_then_write. If the task starts by rediscovering the fix surface, it is no longer isolating write quality.
For puncture-backed write tasks, scope quality should be measured relative to the post-setup baseline. The injected puncture test file is already dirty before the agent starts, so raw final git status overcounts the agent's write surface.
Primary scoring:
- completion
- diff quality
- scope quality
- wrong edits
- hallucinated APIs
Purpose:
- test the full workflow an engineer actually cares about
- require the model to investigate first and then edit based on that understanding
Examples:
- "Find the likely implementation area, explain the dependency chain, then make the minimal patch."
- "Determine whether the failure is in the implementation or the test, then fix the right side."
- "Trace the blast radius, update the right surfaces, and run the narrowest sufficient verification."
- "Decide the correct ownership layer, explain the invariants and regression risks, then patch only that layer."
Primary scoring:
- read quality
- write quality
- transition quality between analysis and patch
- completion
- elapsed time
- retries
- scope and diff quality
Every realistic task should satisfy most of these:
- requires at least 3 meaningful repo hops before the correct edit is obvious
- starts from a natural engineering prompt, not a hidden planted-bug description
- touches real abstractions used by the codebase
- has at least one plausible wrong turn
- rewards structural lookup over raw grep
- can be scored with tests, diff quality, and behavior metrics
The first directed suite should include tasks like these:
- Locate the fix surface Instruction: find the smallest set of files and symbols likely involved in the bug, and justify the choice before editing.
- Explain the dependency chain Instruction: trace how behavior flows from the user-facing surface to the actual implementation and name the likely blast radius.
- Safe multi-file change Instruction: make the change only after identifying callers or downstream effects that must move with it.
- Refactor with justification Instruction: rename or reshape an internal API, but first enumerate the impacted symbols and test surfaces.
- Investigation before edit Instruction: answer "what breaks if we change this?" and only then apply the minimal patch.
- Verify with targeted commands Instruction: choose the most relevant test or build command and explain why it is sufficient.
- Ownership and invariants Instruction: identify which layer should own the change, what invariants must remain true, and what the main regression risks are before editing.
- Migration and rollout judgment Instruction: explain whether the change should be local, coordinated across callers, or split into phases.
And they should be distributed across modes:
- 3 read-only tasks
- 3 write-only tasks
- 5 read-then-write tasks
The first suite should include tasks like these:
- Bug triage from failing integration test Root cause sits 2-4 symbols away from the failing surface and requires updating both implementation and tests.
- Feature flag propagation Add a new option that must travel through config parsing, model wiring, runtime behavior, and docs or tests.
- Safe API rename Rename a public or crate-local API across implementations, call sites, and tests without collateral edits.
- Signature migration Add or remove a parameter on a shared helper and update downstream callers correctly.
- Endpoint behavior change Adjust request validation, serialization, and handler behavior across router, model, and tests.
- Contract mismatch fix Trace an interface or trait behavior bug where the failing assertion appears far from the source.
- Investigation-only impact task Answer "what breaks if we change this?" accurately before any edit is made.
- Dead code or safe-delete task Determine whether a symbol can be removed, defend the answer, and only edit if the impact is safe.
- Test maintenance after intended behavior shift Distinguish stale tests from broken implementation, then update the correct side.
- Small feature addition in a large repo Implement a narrow behavior change that requires placement, impact tracing, and a multi-file patch.
Directed tasks should be scored on both the result and the intermediate work:
- phase 1 quality: did the agent identify the right files and symbols early?
- phase 2 quality: was the dependency chain or blast-radius explanation correct?
- completion: did the repo end in the correct behavior?
- elapsed time
- action count
- retry count
- scope quality
- diff quality
- wrong edits
- hallucinated APIs
Realistic tasks should be scored on more than pass/fail:
- completion: did the repo end in the correct behavior?
- elapsed time: how long to finish?
- action count: how many total steps?
- retry count: how many visible wrong turns?
- scope quality: did the agent touch irrelevant files?
- diff quality: is the patch minimal and coherent?
- structural leverage: did the agent reduce exploratory churn by using the right repo tools?
Directed evals should permit explicit evaluator commands between phases.
Allowed evaluator commands:
- "Find the likely implementation area first."
- "Now explain what else this change touches."
- "Make the minimal patch."
- "Run the most relevant verification command."
- "Stop editing and justify the current scope."
Disallowed evaluator behavior:
- revealing the hidden merged fix
- naming the exact file or symbol unless the task itself includes it
- giving implementation hints that collapse the search problem
The evaluator should be allowed to steer the workflow with commands, because that is the intended operating model for yoyo.
We now have one clean directed treatment result worth keeping:
- task:
ripgrep-global-gitignore-rust-001 - mode:
read_only - repo:
BurntSushi/ripgrep - treatment only: Codex with
yoyo
Questions used in the run:
Find the 3 most likely files or symbols involved in this bug. Do not edit anything.Which layer should own this fix and why? Answer in terms of CLI argument handling, walker construction, and ignore matching internals. Do not edit anything.State the key invariants that must remain true if we fix this bug, and name the main regression risks or blast radius. Do not edit anything.
What the run showed:
- the model localized the bug to the expected CLI -> walker -> ignore seam
- it placed ownership in
crates/ignore, not in CLI path rewriting - it surfaced the important invariants and regression risks without editing code
Metrics:
22total tool calls22yoyoMCP calls0shell calls16retries
Artifact:
Interpretation:
- this is evidence for groundedness under direction
- it is not yet a broad with-vs-without benchmark
- it is not yet a write benchmark
The target workflow is:
- an engineer gives a concrete command
- the model uses
yoyoto carry it out - the engineer gives the next command based on the result
Examples:
- "Find the 2 most likely files for this bug."
- "Show me the dependency chain before changing anything."
- "Which layer should own this behavior?"
- "What invariants do we risk breaking if we patch it here?"
- "Patch only the minimum surface."
- "Run the exact verification command."
This is closer to how an MCP tool suite is actually used in practice than a one-shot autonomous prompt.
Issue-first evals should avoid evaluator coaching during the run.
Allowed inputs:
- the repository at the pre-fix commit
- the visible issue prompt
- local repo signals the agent can obtain itself, such as tests, build output, docs, and source
Disallowed inputs:
- the merged PR diff or patch
- maintainer comments that reveal the fix
- ad hoc human hints during the run
If the agent asks what to do next, reply with exactly:
OK, continue.
Do not elaborate. Do not paraphrase. Do not add technical hints.
This keeps control and treatment runs comparable while still letting the agent continue exploring.
yoyo should not be expected to win by making trivial fixes marginally faster.
It should win by:
- finding the right symbols earlier
- reducing exploratory file hopping
- tracing callers and blast radius correctly
- making safer multi-file changes
- avoiding wrong edits on ambiguous tasks
- Keep puncture tasks for regression smoke tests.
- Build 5 directed tool-use tasks based on real OSS issues or PRs.
- Keep 3 to 5 autonomous issue-first tasks as integration checks.
- Expand both suites across Go, Rust, and TypeScript.
- Use directed tool-use tasks as the primary release benchmark.