Refactor benchmark tests for cleanup_test_output by anushakamran06 · Pull Request #5402 · Aider-AI/aider

anushakamran06 · 2026-07-05T09:18:00Z

Motivation

benchmark/test_benchmark.py the only test file for the benchmark harness has been broken since mid-2023 and currently cannot run at all. This was independently flagged in a note on #5136:

python -m unittest benchmark/test_benchmark.py is currently blocked by the existing benchmark test import/signature mismatch (cleanup_test_output() now requires testdir).

This PR fixes the tests so the suite runs and passes again, and adds coverage for the testdir behavior that was never tested.

Root cause: the tests were written against a 2023 version of cleanup_test_output. The function has since changed in two ways the tests never tracked:

Signature drift. cleanup_test_output(output) became cleanup_test_output(output, testdir) when testdir-path scrubbing was added (replacing the full test directory path with testdir.name in test output). Both existing tests still call it with one argument → immediate TypeError.
Behavior drift. The second test asserts that long ==== / ---- separator lines are truncated to ==== / ----. That truncation was removed from cleanup_test_output after the June 2023 commit that introduced it ("remove long ==--- lines to save tokens") and no longer exists in the function, so the assertion tests phantom behavior.
Stale expected value. The first test expects "Ran 5 tests in 0.003s\nOK" → "\nOK", but the current regex r"\bin \d+\.\d+s\b" only removes the timing substring, yielding "Ran 5 tests \nOK" (leading text and trailing space are preserved).

Reproduction (on main, from inside benchmark/):

$ python -m unittest test_benchmark -v
TypeError: cleanup_test_output() missing 1 required positional argument: 'testdir'

Modifications

All changes are confined to benchmark/test_benchmark.py. No production code is touched — the tests are updated to match the current, long-standing behavior of cleanup_test_output.

Added a setUp providing a realistic testdir Path matching the harness layout (tmp.benchmarks/<date>/<lang>/exercises/practice/<exercise>), and passed it to every call.
Corrected the timing-removal test's expected value to "Ran 5 tests \nOK" per the actual regex behavior.
Removed the ====/---- truncation test — that behavior no longer exists in cleanup_test_output.
Added two new tests for previously untested behavior:
- the testdir path is replaced with testdir.name in output (this is what anonymizes absolute paths before test output is fed back to the LLM)
- timing removal and path replacement compose correctly in one output

Test inputs are built with f-strings off str(self.testdir), so path-separator differences make the tests pass identically on Windows and POSIX.

Tests

$ cd benchmark && python -m unittest test_benchmark -v
test_output_without_timing_is_unchanged ... ok
test_removes_timing_and_path_together ... ok
test_removes_timing_info ... ok
test_replaces_testdir_path_with_its_name ... ok

Note: python -m unittest test_benchmark must be run from inside benchmark/ so that benchmark.py (which imports its sibling modules prompts and plots as top-level modules) resolves correctly. Running via pytest from the repo root resolves benchmark to the package (benchmark/__init__.py) rather than the module and fails on import — a pre-existing quirk, noted under out-of-scope below.

Impact

No production code changes; cleanup_test_output is untouched.
Unblocks the test-suite note left on fix: accept model-settings-file in benchmark #5136.
Restores a working baseline so future changes to the harness can actually be tested.

Out of scope (possible follow-ups)

The benchmark/__init__.py vs benchmark.py name collision that prevents running these tests via pytest with default import mode.
Broader unit coverage for benchmark.py (run_unit_tests command selection, summarize_results), which currently has none.

CLAassistant · 2026-07-05T09:18:07Z

All committers have signed the CLA.

Refactor benchmark tests for cleanup_test_output

61fb6a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor benchmark tests for cleanup_test_output#5402

Refactor benchmark tests for cleanup_test_output#5402
anushakamran06 wants to merge 1 commit into
Aider-AI:mainfrom
anushakamran06:main

anushakamran06 commented Jul 5, 2026

Uh oh!

CLAassistant commented Jul 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

anushakamran06 commented Jul 5, 2026

Motivation

Modifications

Tests

Impact

Out of scope (possible follow-ups)

Uh oh!

CLAassistant commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Jul 5, 2026 •

edited

Loading