Skip to content

Refactor benchmark tests for cleanup_test_output#5402

Open
anushakamran06 wants to merge 1 commit into
Aider-AI:mainfrom
anushakamran06:main
Open

Refactor benchmark tests for cleanup_test_output#5402
anushakamran06 wants to merge 1 commit into
Aider-AI:mainfrom
anushakamran06:main

Conversation

@anushakamran06

Copy link
Copy Markdown

Motivation

benchmark/test_benchmark.py the only test file for the benchmark harness has been broken since mid-2023 and currently cannot run at all. This was independently flagged in a note on #5136:

python -m unittest benchmark/test_benchmark.py is currently blocked by the existing benchmark test import/signature mismatch (cleanup_test_output() now requires testdir).

This PR fixes the tests so the suite runs and passes again, and adds coverage for the testdir behavior that was never tested.

Root cause: the tests were written against a 2023 version of cleanup_test_output. The function has since changed in two ways the tests never tracked:

  1. Signature drift. cleanup_test_output(output) became cleanup_test_output(output, testdir) when testdir-path scrubbing was added (replacing the full test directory path with testdir.name in test output). Both existing tests still call it with one argument → immediate TypeError.
  2. Behavior drift. The second test asserts that long ==== / ---- separator lines are truncated to ==== / ----. That truncation was removed from cleanup_test_output after the June 2023 commit that introduced it ("remove long ==--- lines to save tokens") and no longer exists in the function, so the assertion tests phantom behavior.
  3. Stale expected value. The first test expects "Ran 5 tests in 0.003s\nOK""\nOK", but the current regex r"\bin \d+\.\d+s\b" only removes the timing substring, yielding "Ran 5 tests \nOK" (leading text and trailing space are preserved).

Reproduction (on main, from inside benchmark/):

$ python -m unittest test_benchmark -v
TypeError: cleanup_test_output() missing 1 required positional argument: 'testdir'

Modifications

All changes are confined to benchmark/test_benchmark.py. No production code is touched — the tests are updated to match the current, long-standing behavior of cleanup_test_output.

  1. Added a setUp providing a realistic testdir Path matching the harness layout (tmp.benchmarks/<date>/<lang>/exercises/practice/<exercise>), and passed it to every call.
  2. Corrected the timing-removal test's expected value to "Ran 5 tests \nOK" per the actual regex behavior.
  3. Removed the ====/---- truncation test — that behavior no longer exists in cleanup_test_output.
  4. Added two new tests for previously untested behavior:
    • the testdir path is replaced with testdir.name in output (this is what anonymizes absolute paths before test output is fed back to the LLM)
    • timing removal and path replacement compose correctly in one output

Test inputs are built with f-strings off str(self.testdir), so path-separator differences make the tests pass identically on Windows and POSIX.

Tests

$ cd benchmark && python -m unittest test_benchmark -v
test_output_without_timing_is_unchanged ... ok
test_removes_timing_and_path_together ... ok
test_removes_timing_info ... ok
test_replaces_testdir_path_with_its_name ... ok

Note: python -m unittest test_benchmark must be run from inside benchmark/ so that benchmark.py (which imports its sibling modules prompts and plots as top-level modules) resolves correctly. Running via pytest from the repo root resolves benchmark to the package (benchmark/__init__.py) rather than the module and fails on import — a pre-existing quirk, noted under out-of-scope below.

Impact

Out of scope (possible follow-ups)

  • The benchmark/__init__.py vs benchmark.py name collision that prevents running these tests via pytest with default import mode.
  • Broader unit coverage for benchmark.py (run_unit_tests command selection, summarize_results), which currently has none.

@CLAassistant

CLAassistant commented Jul 5, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants