Skip to content

Fix tournament Elo updates#17

Open
maxim-veksler wants to merge 4 commits into
mainfrom
elo-fix
Open

Fix tournament Elo updates#17
maxim-veksler wants to merge 4 commits into
mainfrom
elo-fix

Conversation

@maxim-veksler
Copy link
Copy Markdown
Collaborator

Summary

  • rate completed tournaments as one Elo rating period
  • compute expected scores from pre-tournament ratings to avoid async completion-order bias
  • add focused Elo tests for expected score math, zero-sum behavior, and order-independent updates

Tests

  • uv run pytest tests\test_elo.py -q
  • uv run pytest -q
  • uv run ruff check darwin\tournament\elo.py tests\test_elo.py darwin\orchestration\generation.py

asrinivasan75 added a commit that referenced this pull request Apr 25, 2026
…-change)

Brings the propose → code → critique → fix chain from Kevin's
prompt-change branch. Skips the rest of that branch — UI overhaul,
runner.py simplification (would revert per-game error handling and
auto-fallback), elo.py revert (would lose Max's PR #17 fix), and the
deletion of the new logo files (PR #19).

What's added:

  agents/adversary.py — critique_engine(question, source, name).
    Reads builder output + originating question, returns a focused
    critique paragraph. Empty string on any LLM error.

  agents/fixer.py — fix_engine(path, question, critique, ...).
    Runs a second builder-style call with the critique baked in.
    Overwrites the engine file in place on success; leaves the
    original untouched on failure (degrades gracefully).

  agents/prompts/adversary_v1.md, fixer_v1.md — prompts.

  Orchestrator integration: in `_validate_one`, between build and
  validate, run adversary then fixer when settings.enable_adversary.
  New WS events: adversary.completed, fixer.completed (frontend
  doesn't render these yet but the bus accepts any dict).

  config.py: Provider type alias, per-role overrides
  (strategist_provider, player_provider, builder_provider,
  adversary_provider), adversary_model, enable_adversary,
  provider_for(role) helper. KEEP max_parallel_games=16 (don't
  revert to 2 from prompt-change).

  builder.py: passes provider=settings.provider_for("builder") to
  complete().

  llm.py: complete() and complete_text() take optional `provider`
  kwarg, falls back to settings.llm_provider when None. Lets a
  single generation fan out roles across providers (e.g. claude
  strategist + gemini builder + claude adversary).

  baseline.py: tightens terminal-state handling — explicit
  insufficient-material + halfmove-clock checks instead of slow
  `is_game_over(claim_draw=True)`, plus stalemate/check distinction
  in the search base case.

  tests: test_adversary.py + test_fixer.py (both new). Total now
  63 passing.

What's NOT taken from prompt-change:
  - strategist.py (would revert our deterministic version)
  - frontend/ (would revert color counter, 2-board cap, bracket
    aggregate cells, runner-up roster fix)
  - tournament/runner.py (would revert per-game error handling
    + auto-fallback)
  - tournament/elo.py + test_elo.py (would revert Max's PR #17)
  - logo deletions (would undo PR #19)
  - max_parallel_games 16 → 2 regression
  - tsconfig.json + tailwind.config.js + index.css

.env updated: ADVERSARY_MODEL=gemini-3-flash-preview so the new
adversary uses the same provider/model as the rest of the roles.
nathanaronson pushed a commit that referenced this pull request Apr 25, 2026
…-change)

Brings the propose → code → critique → fix chain from Kevin's
prompt-change branch. Skips the rest of that branch — UI overhaul,
runner.py simplification (would revert per-game error handling and
auto-fallback), elo.py revert (would lose Max's PR #17 fix), and the
deletion of the new logo files (PR #19).

What's added:

  agents/adversary.py — critique_engine(question, source, name).
    Reads builder output + originating question, returns a focused
    critique paragraph. Empty string on any LLM error.

  agents/fixer.py — fix_engine(path, question, critique, ...).
    Runs a second builder-style call with the critique baked in.
    Overwrites the engine file in place on success; leaves the
    original untouched on failure (degrades gracefully).

  agents/prompts/adversary_v1.md, fixer_v1.md — prompts.

  Orchestrator integration: in `_validate_one`, between build and
  validate, run adversary then fixer when settings.enable_adversary.
  New WS events: adversary.completed, fixer.completed (frontend
  doesn't render these yet but the bus accepts any dict).

  config.py: Provider type alias, per-role overrides
  (strategist_provider, player_provider, builder_provider,
  adversary_provider), adversary_model, enable_adversary,
  provider_for(role) helper. KEEP max_parallel_games=16 (don't
  revert to 2 from prompt-change).

  builder.py: passes provider=settings.provider_for("builder") to
  complete().

  llm.py: complete() and complete_text() take optional `provider`
  kwarg, falls back to settings.llm_provider when None. Lets a
  single generation fan out roles across providers (e.g. claude
  strategist + gemini builder + claude adversary).

  baseline.py: tightens terminal-state handling — explicit
  insufficient-material + halfmove-clock checks instead of slow
  `is_game_over(claim_draw=True)`, plus stalemate/check distinction
  in the search base case.

  tests: test_adversary.py + test_fixer.py (both new). Total now
  63 passing.

What's NOT taken from prompt-change:
  - strategist.py (would revert our deterministic version)
  - frontend/ (would revert color counter, 2-board cap, bracket
    aggregate cells, runner-up roster fix)
  - tournament/runner.py (would revert per-game error handling
    + auto-fallback)
  - tournament/elo.py + test_elo.py (would revert Max's PR #17)
  - logo deletions (would undo PR #19)
  - max_parallel_games 16 → 2 regression
  - tsconfig.json + tailwind.config.js + index.css

.env updated: ADVERSARY_MODEL=gemini-3-flash-preview so the new
adversary uses the same provider/model as the rest of the roles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants