Fix tournament Elo updates by maxim-veksler · Pull Request #17 · nathanaronson/darwin

maxim-veksler · 2026-04-25T16:29:47Z

Summary

rate completed tournaments as one Elo rating period
compute expected scores from pre-tournament ratings to avoid async completion-order bias
add focused Elo tests for expected score math, zero-sum behavior, and order-independent updates

Tests

uv run pytest tests\test_elo.py -q
uv run pytest -q
uv run ruff check darwin\tournament\elo.py tests\test_elo.py darwin\orchestration\generation.py

download best + baseline

…-change) Brings the propose → code → critique → fix chain from Kevin's prompt-change branch. Skips the rest of that branch — UI overhaul, runner.py simplification (would revert per-game error handling and auto-fallback), elo.py revert (would lose Max's PR #17 fix), and the deletion of the new logo files (PR #19). What's added: agents/adversary.py — critique_engine(question, source, name). Reads builder output + originating question, returns a focused critique paragraph. Empty string on any LLM error. agents/fixer.py — fix_engine(path, question, critique, ...). Runs a second builder-style call with the critique baked in. Overwrites the engine file in place on success; leaves the original untouched on failure (degrades gracefully). agents/prompts/adversary_v1.md, fixer_v1.md — prompts. Orchestrator integration: in `_validate_one`, between build and validate, run adversary then fixer when settings.enable_adversary. New WS events: adversary.completed, fixer.completed (frontend doesn't render these yet but the bus accepts any dict). config.py: Provider type alias, per-role overrides (strategist_provider, player_provider, builder_provider, adversary_provider), adversary_model, enable_adversary, provider_for(role) helper. KEEP max_parallel_games=16 (don't revert to 2 from prompt-change). builder.py: passes provider=settings.provider_for("builder") to complete(). llm.py: complete() and complete_text() take optional `provider` kwarg, falls back to settings.llm_provider when None. Lets a single generation fan out roles across providers (e.g. claude strategist + gemini builder + claude adversary). baseline.py: tightens terminal-state handling — explicit insufficient-material + halfmove-clock checks instead of slow `is_game_over(claim_draw=True)`, plus stalemate/check distinction in the search base case. tests: test_adversary.py + test_fixer.py (both new). Total now 63 passing. What's NOT taken from prompt-change: - strategist.py (would revert our deterministic version) - frontend/ (would revert color counter, 2-board cap, bracket aggregate cells, runner-up roster fix) - tournament/runner.py (would revert per-game error handling + auto-fallback) - tournament/elo.py + test_elo.py (would revert Max's PR #17) - logo deletions (would undo PR #19) - max_parallel_games 16 → 2 regression - tsconfig.json + tailwind.config.js + index.css .env updated: ADVERSARY_MODEL=gemini-3-flash-preview so the new adversary uses the same provider/model as the rest of the roles.

kevinxuez and others added 4 commits April 25, 2026 03:36

Merge pull request #15 from nathanaronson/smoke-and-save-button

049a97c

download best + baseline

Merge branch 'main' of https://github.com/nathanaronson/cubist-chess-…

facc546

…engine

Merge branch 'main' of https://github.com/nathanaronson/cubist-chess-…

d149c43

…engine

Fix tournament Elo updates

2d79f19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tournament Elo updates#17

Fix tournament Elo updates#17
maxim-veksler wants to merge 4 commits into
mainfrom
elo-fix

maxim-veksler commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxim-veksler commented Apr 25, 2026

Summary

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants