Fix tournament Elo updates#17
Open
maxim-veksler wants to merge 4 commits into
Open
Conversation
download best + baseline
asrinivasan75
added a commit
that referenced
this pull request
Apr 25, 2026
…-change) Brings the propose → code → critique → fix chain from Kevin's prompt-change branch. Skips the rest of that branch — UI overhaul, runner.py simplification (would revert per-game error handling and auto-fallback), elo.py revert (would lose Max's PR #17 fix), and the deletion of the new logo files (PR #19). What's added: agents/adversary.py — critique_engine(question, source, name). Reads builder output + originating question, returns a focused critique paragraph. Empty string on any LLM error. agents/fixer.py — fix_engine(path, question, critique, ...). Runs a second builder-style call with the critique baked in. Overwrites the engine file in place on success; leaves the original untouched on failure (degrades gracefully). agents/prompts/adversary_v1.md, fixer_v1.md — prompts. Orchestrator integration: in `_validate_one`, between build and validate, run adversary then fixer when settings.enable_adversary. New WS events: adversary.completed, fixer.completed (frontend doesn't render these yet but the bus accepts any dict). config.py: Provider type alias, per-role overrides (strategist_provider, player_provider, builder_provider, adversary_provider), adversary_model, enable_adversary, provider_for(role) helper. KEEP max_parallel_games=16 (don't revert to 2 from prompt-change). builder.py: passes provider=settings.provider_for("builder") to complete(). llm.py: complete() and complete_text() take optional `provider` kwarg, falls back to settings.llm_provider when None. Lets a single generation fan out roles across providers (e.g. claude strategist + gemini builder + claude adversary). baseline.py: tightens terminal-state handling — explicit insufficient-material + halfmove-clock checks instead of slow `is_game_over(claim_draw=True)`, plus stalemate/check distinction in the search base case. tests: test_adversary.py + test_fixer.py (both new). Total now 63 passing. What's NOT taken from prompt-change: - strategist.py (would revert our deterministic version) - frontend/ (would revert color counter, 2-board cap, bracket aggregate cells, runner-up roster fix) - tournament/runner.py (would revert per-game error handling + auto-fallback) - tournament/elo.py + test_elo.py (would revert Max's PR #17) - logo deletions (would undo PR #19) - max_parallel_games 16 → 2 regression - tsconfig.json + tailwind.config.js + index.css .env updated: ADVERSARY_MODEL=gemini-3-flash-preview so the new adversary uses the same provider/model as the rest of the roles.
nathanaronson
pushed a commit
that referenced
this pull request
Apr 25, 2026
…-change) Brings the propose → code → critique → fix chain from Kevin's prompt-change branch. Skips the rest of that branch — UI overhaul, runner.py simplification (would revert per-game error handling and auto-fallback), elo.py revert (would lose Max's PR #17 fix), and the deletion of the new logo files (PR #19). What's added: agents/adversary.py — critique_engine(question, source, name). Reads builder output + originating question, returns a focused critique paragraph. Empty string on any LLM error. agents/fixer.py — fix_engine(path, question, critique, ...). Runs a second builder-style call with the critique baked in. Overwrites the engine file in place on success; leaves the original untouched on failure (degrades gracefully). agents/prompts/adversary_v1.md, fixer_v1.md — prompts. Orchestrator integration: in `_validate_one`, between build and validate, run adversary then fixer when settings.enable_adversary. New WS events: adversary.completed, fixer.completed (frontend doesn't render these yet but the bus accepts any dict). config.py: Provider type alias, per-role overrides (strategist_provider, player_provider, builder_provider, adversary_provider), adversary_model, enable_adversary, provider_for(role) helper. KEEP max_parallel_games=16 (don't revert to 2 from prompt-change). builder.py: passes provider=settings.provider_for("builder") to complete(). llm.py: complete() and complete_text() take optional `provider` kwarg, falls back to settings.llm_provider when None. Lets a single generation fan out roles across providers (e.g. claude strategist + gemini builder + claude adversary). baseline.py: tightens terminal-state handling — explicit insufficient-material + halfmove-clock checks instead of slow `is_game_over(claim_draw=True)`, plus stalemate/check distinction in the search base case. tests: test_adversary.py + test_fixer.py (both new). Total now 63 passing. What's NOT taken from prompt-change: - strategist.py (would revert our deterministic version) - frontend/ (would revert color counter, 2-board cap, bracket aggregate cells, runner-up roster fix) - tournament/runner.py (would revert per-game error handling + auto-fallback) - tournament/elo.py + test_elo.py (would revert Max's PR #17) - logo deletions (would undo PR #19) - max_parallel_games 16 → 2 regression - tsconfig.json + tailwind.config.js + index.css .env updated: ADVERSARY_MODEL=gemini-3-flash-preview so the new adversary uses the same provider/model as the rest of the roles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tests