Skip to content

Latest commit

 

History

History
730 lines (513 loc) · 33.9 KB

File metadata and controls

730 lines (513 loc) · 33.9 KB

Post-Debate Review & System Improvement

Read this file after completing Phase 7 (Synthesis).

You have just finished orchestrating a multi-agent debate. Before closing out, you must perform a systematic review and make improvements to the debate system for future runs.


STEP 1: Review the Debate Outputs

Read through these files from the debate you just completed:

  1. 05_debate_log.md - The actual debate transcript
  2. 06_reflections.md - Agent reflections
  3. 07_synthesis.md - Your synthesis report
  4. _state/quality_assessment.json - Final quality metrics
  5. _state/attack_registry.json - Attack/response tracking
  6. _state/contribution_tracker.json - Contribution counts

STEP 2: Evaluate Against Quality Criteria

Answer these questions honestly:

A. Agent Behavior

  • Did agents quote each other directly and rebut specific claims?
  • Did agents use emotional, confrontational language (not diplomatic)?
  • Did agents avoid forbidden phrases ("I understand your point...", etc.)?
  • Did agents defend themselves when attacked?
  • Did agents stay in character throughout?
  • Were the personas distinct and authentic?

B. Debate Dynamics

  • Were there genuine back-and-forth exchanges (3+ turns on same point)?
  • Did the debate feel heated, not like a polite panel?
  • Were there surprising moments (unexpected alliances, concessions)?
  • Did agents actually change positions or make real concessions in reflections?
  • Was the debate substantive (5-8 sentences per contribution)?

C. System Compliance

  • Did you use sequential waves (2-3 agents at a time)?
  • Did you update state files after each wave?
  • Did you use the attack registry to force responses?
  • Did you check quality gates before advancing phases?
  • Did you use Opus subagents (not Haiku/Sonnet)?
  • Did agents read their persona cards and the debate rules?

D. Output Quality

  • Does the synthesis report capture the key disagreements?
  • Are the policy recommendations grounded in the debate?
  • Did you identify the core cruxes (empirical and value-based)?

STEP 3: Identify What Went Wrong

For each "No" answer above, write a brief note:

  • What specifically happened?
  • Why did it happen?
  • What could have prevented it?

Common failure patterns to look for:

  • Diplomatic collapse: Agents being too nice to each other
  • Parallel monologues: Agents not reacting to each other
  • Shallow takes: Short, superficial contributions
  • Role drift: Agents breaking character
  • Missed attacks: Unresolved attacks in the registry
  • Template ignorance: Agents not reading persona cards or rules
  • Premature advancement: Moving phases before quality gates passed

STEP 4: Make System Improvements

Based on your review, you MUST now edit the system files to prevent these issues in future debates.

Files You Can Edit:

  1. debate_initialisation_prompt.md - Main orchestrator instructions

    • Add warnings about specific failure modes you observed
    • Clarify instructions that were ambiguous
    • Add new quality checks if needed
    • Improve agent invocation templates
  2. quick_start.md - Quick onboarding guide

    • Add lessons learned
    • Update common scenarios section
    • Clarify any confusing instructions
  3. _master_templates/debate_rules.json - Debate behavior rules

    • Add forbidden phrases you observed being used
    • Add required phrases that worked well
    • Adjust response length requirements if needed
  4. _master_templates/moderator_prompts.json - Moderator interventions

    • Add new escalation prompts that would have helped
    • Add new opening clash templates
    • Improve existing prompts based on what worked
  5. _master_templates/persona_card_template.json - Agent persona structure

    • Add fields that would have been useful
    • Improve the example if it was unclear
  6. _master_templates/state_schemas.json - State file documentation

    • Add clarity if state files were misused

Guidelines for Edits:

DO:

  • Add specific warnings based on real failures ("WARNING: In testing, agents often did X...")
  • Add new forbidden/required phrases you discovered
  • Improve clarity of ambiguous instructions
  • Add new moderator prompts that would have helped
  • Document lessons learned prominently

DON'T:

  • Remove existing instructions without good reason
  • Make changes that aren't grounded in observed problems
  • Add complexity without clear benefit
  • Change file formats or folder structure without very strong justification

STEP 5: Document Your Changes

After making edits, add an entry to this file:

Change Log

## [DATE] - [DEBATE TOPIC]

### Triggered By:
[Name the specific debate that revealed this issue, e.g., "OpenAI 2026 Leadership (8 Participants)"]

### Issues Observed:
- [Issue 1]
- [Issue 2]

### Changes Made:
- [File]: [What was changed and why]
- [File]: [What was changed and why]

### Expected Impact:
- [How this should improve future debates]

STEP 6: Structural Changes (Optional)

If you have a strong, well-reasoned idea for a structural improvement, you may:

  • Add new template files to _master_templates/
  • Create new state tracking mechanisms
  • Restructure the phase workflow

Requirements for structural changes:

  1. Write a clear rationale (what problem does this solve?)
  2. Ensure backward compatibility (don't break existing debates)
  3. Update all documentation to reflect the change
  4. Keep it simple - complexity should be justified

Change Log

Add your entries below this line:


2025-12-11 - AI State December 2025 (8 Participants)

Issues Observed:

  1. State files never updated (CRITICAL): After 5 waves of debate, attack_registry.json, contribution_tracker.json, and quality_assessment.json were all still at their initial default values. The orchestrator completely skipped the MODERATOR UPDATE ROUTINE.

  2. Quality gates never formally checked: The orchestrator advanced phases without verifying that ready_for_next_phase == true in quality_assessment.json.

  3. No PDF output generated: The debate concluded without generating a PDF synthesis report because no instructions existed for this.

  4. Agents may not have read debate_rules.json consistently: While agents behaved appropriately (likely from persona prompting), there was no verification they actually read the rules file.

What Went Well:

  • Agents quoted each other directly and made substantive rebuttals
  • Strong emotional, confrontational language maintained throughout
  • Genuine position shifts occurred (TECHNO conceding Weber's point)
  • Sequential wave execution worked correctly (2-3 agents at a time)
  • Opus subagents produced high-quality roleplay
  • Synthesis outputs were comprehensive and insightful

Changes Made:

debate_initialisation_prompt.md:

  • Added prominent WARNING box before MODERATOR UPDATE ROUTINE emphasizing state file updates are mandatory
  • Added POST-WAVE CHECKLIST (7 items) that must be completed after every wave
  • Added new Phase 7.5 — PDF GENERATION with full XeLaTeX template and compilation instructions
  • Added new failure modes to appendix: "Skipped State Updates", "Missing PDF Output", "No Formal Quality Gates"
  • Added quality signals for state file updates and PDF generation

quick_start.md:

  • Added new critical lesson: "The #6 Mistake: Skipping State File Updates (MOST COMMON!)"
  • Added new critical lesson: "The #7 Mistake: No PDF Output"
  • Added common scenario: "I forgot to update state files for several waves"
  • Added common scenario: "XeLaTeX isn't installed"
  • Updated TL;DR checklist to emphasize state file updates and include Phase 7.5 and Phase 8

_master_templates/moderator_prompts.json:

  • Added new "post_wave_checklist" section with 7 mandatory items and verification command

_master_templates/debate_rules.json:

  • Added new "moderator_reminders" section with after_every_wave and before_phase_transition checklists
  • Included explicit warning about the common failure observed in testing

Expected Impact:

  1. State file compliance: The prominent warnings, mandatory checklists, and explicit failure documentation should make it nearly impossible for future orchestrators to skip state file updates.

  2. PDF output: Every debate will now produce a consistently styled PDF as a final deliverable.

  3. Quality gates: Explicit reminders to check quality_assessment.json before phase transitions should prevent premature advancement.

  4. Self-documenting failures: By documenting the exact failure mode in multiple places, future orchestrators will recognize when they're about to make the same mistake.


2025-12-11 - Dynamic Deliverables System

Motivation:

The original system was hardcoded for political debates with outputs like "policy bundles" and "stakeholder impacts". This doesn't make sense for technical debates (architecture decisions), strategic debates (business decisions), research debates, etc.

Changes Made:

debate_initialisation_prompt.md:

  • Added new section "DEBATE TYPES AND DELIVERABLES" after user inputs
  • Defined 7 debate types: policy, technical, strategic, ethical, research, risk, general
  • Each type has trigger keywords for automatic detection
  • Defined MANDATORY deliverables (all types): debate_log, reflections, cruxes, matrix.json, PDF
  • Defined TYPE-SPECIFIC deliverables for each type (e.g., technical debates get decision_matrix, recommendation; risk debates get risk_register, mitigation_strategies)
  • Added PERSONA ADAPTATION table matching agent archetypes to debate types
  • Updated Phase 0 to include debate type detection and deliverable selection
  • Updated Phase 7 to produce type-appropriate outputs dynamically
  • Updated PDF template to have type-specific section structures

quick_start.md:

  • Added "Debate Types and Dynamic Deliverables" section with summary table
  • Shows what triggers each type and what outputs are produced

_master_templates/persona_card_template.json:

  • Updated to version 3.0
  • Made spectrum_position flexible with examples for each debate type (policy, technical, strategic, ethical, research, risk)
  • Added "background" field for professional context
  • Added "key_argument" and "what_would_change_mind" to on_this_topic
  • Replaced single example with multiple examples for different debate types (policy, technical, strategic, risk)

Expected Impact:

  1. Flexibility: System now handles technical debates, business strategy debates, risk assessments, etc. - not just political debates
  2. Appropriate outputs: Technical debates produce decision matrices, not policy bundles
  3. Better personas: Agents can be architects, CFOs, security leads - not just political positions
  4. Same rigor: Mandatory outputs (cruxes, reflections, PDF) ensure every debate produces comparable analysis

2025-12-11 - User Clarification Step

Motivation:

Debates work better when tailored to user needs. Open-ended topics need scoping, users may want specific outputs, and context matters. Rather than guessing, the orchestrator should ask a few targeted questions before starting.

Changes Made:

quick_start.md:

  • Added new STEP 4: "Clarify with the User Before Starting"
  • Includes question categories: scope, outputs, context, depth
  • Provides example format for asking questions
  • Explains how to handle "just proceed" responses
  • Renumbered all subsequent steps (STEP 5-11)
  • Updated TL;DR checklist to include clarification step

debate_initialisation_prompt.md:

  • Added "ASK CLARIFYING QUESTIONS" section after Phase 0 planning
  • Lists question templates by category
  • Emphasizes: don't over-ask, use judgment, offer to proceed with defaults

Expected Impact:

  1. More relevant debates: Topics get properly scoped before starting
  2. Better outputs: User can specify what deliverables they actually need
  3. Context-aware: Orchestrator learns about constraints, audience, existing decisions
  4. User control: Users can skip questions and proceed with defaults if they prefer
  5. Simple questions: Questions are quick to answer if user knows what they want

2025-12-11 - Research Requirements

Motivation:

Debates were producing generic arguments from training data rather than evidence-backed positions with current statistics and recent developments. Without research, agents make assertions without citations.

Changes Made:

debate_initialisation_prompt.md:

  • Added "RESEARCH THE TOPIC FIRST" section in Phase 0
    • Orchestrator must do 5-10 web searches before planning
    • Research: current debate state, recent developments, stakeholder positions, key statistics
    • Example searches provided
  • Added "SUBAGENTS MUST DO RESEARCH" section in Phase 3
    • Each agent must do 2-4 searches when writing initial position
    • Must find statistics, studies, recent events, specific examples
    • Provided good/bad examples of evidence-backed vs generic claims
  • Updated AGENT INVOCATION TEMPLATE
    • Added STEP 5: RESEARCH IF NEEDED
    • Agents search during debate when making factual claims or countering opponent data
    • Added "Cite evidence" to critical reminders

quick_start.md:

  • Added "The #0 Mistake: No Research" as first critical lesson
    • Explains orchestrator and subagent research requirements
    • Good/bad examples of evidence quality
  • Updated STEP 6 agent invocation template with research step
  • Updated TL;DR checklist to include research step

Expected Impact:

  1. Evidence-backed debates: Arguments cite specific statistics, studies, and recent events
  2. Current information: Research pulls in 2024-2025 developments, not just training data
  3. Stronger arguments: Agents can cite sources when challenged
  4. Better synthesis: Final reports based on actual evidence, not generic claims
  5. Fact-checkable: Specific citations allow readers to verify claims

2025-12-11 - Tromsø December 2025 (6 Participants, Travel Recommendations)

Issues Observed:

  1. State files never updated (CRITICAL - AGAIN!): Despite existing warnings about this being the #1 failure mode, after 5 waves of debate all state files were still at defaults:

    • attack_registry.json: {"unresolved": [], "resolved": []}
    • contribution_tracker.json: {"target_per_agent": 3, "agents": {}}
    • quality_assessment.json: current_wave: 0
  2. No enforcement mechanism: The warnings existed but nothing STOPPED the orchestrator from proceeding without updating. The checklist was ignored.

  3. PDF generation forgotten: Phase 7.5 was skipped until explicitly requested.

  4. Post-debate review (Phase 8) forgotten: Had to be prompted to do it.

  5. No formal quality gates checked: Phases were advanced based on "feels done" not actual gate verification.

What Went Well:

  • Agents quoted each other extensively and made substantive rebuttals
  • Genuine confrontational energy maintained (no diplomatic collapse)
  • Strong persona differentiation (AURORA passionate, BUDGET numbers-driven, LOCAL skeptical)
  • Position shifts occurred in reflections (AURORA conceded BUDGET's math)
  • Sequential wave execution worked correctly
  • Opus subagents produced high-quality roleplay
  • Final synthesis was actionable and well-structured

Changes Made:

debate_initialisation_prompt.md:

  • Added HUGE warning box at the very top making state file updates the #1 documented failure
  • Added "5. STATE FILES ARE NOT OPTIONAL" to the numbered critical lessons
  • Added "=== ENFORCEMENT: VERIFY BEFORE NEXT WAVE ===" section with mandatory read step
  • Explained WHY state tracking matters (not just that it's required)

quick_start.md:

  • Upgraded #6 Mistake header to "HAPPENS EVERY TIME!" with emoji
  • Added explanation of WHY orchestrators skip it (caught up in agent launching)
  • Added THE ENFORCEMENT FIX with specific verification step
  • Added step 6 to the checklist: "READ the state files back to verify"
  • Added consequence statement: "If you skip this, your debate is useless"

Root Cause Analysis:

The orchestrator got caught up in the "interesting" work (launching agents, reading responses) and treated state file updates as optional bookkeeping. The existing warnings were:

  • Prominent but easily scrolled past
  • Not enforced by any mechanism
  • Disconnected from the "main" workflow

Expected Impact:

  1. Earlier detection: The enforcement step (read quality_assessment.json before next wave) creates a checkpoint that reveals skipped updates immediately
  2. Psychological reframing: Calling it "the moderator's core job" rather than "administrative overhead" may help
  3. Consequence awareness: Explicit statement that "your debate is useless" without tracking may motivate compliance
  4. Pattern recognition: Multiple entries documenting this SAME failure should make future orchestrators recognize it

Remaining Concern:

This is now the THIRD documented instance of this failure. The warnings are extensive. If the next orchestrator STILL skips state updates, consider:

  • Making state file updates part of the agent invocation template (do them immediately after agents complete, before any other action)
  • Creating a "wave wrapper" that bundles agent launch + state update as one unit
  • Adding a pre-flight check at the START of each wave that reads state files

2025-12-11 - Claude Code Hooks Implementation

Motivation:

Despite extensive warnings about state file updates, PDF generation, and research requirements, orchestrators consistently forgot these steps. Text warnings get read once at the start and then buried under debate context. We needed an active reminder system.

Solution Implemented:

Created three Claude Code hooks that fire automatically when specific files are written:

Hook Triggers On Reminds About
debate-state-update.sh 05_debate_log.md Update attack_registry, contribution_tracker, quality_assessment
debate-final-phases.sh 07_synthesis.md Phase 7.5 (PDF) and Phase 8 (review) are mandatory
debate-research-reminder.sh 03_positions.md Do web research before writing positions

Files Created:

  • .claude/hooks/debate-state-update.sh
  • .claude/hooks/debate-final-phases.sh
  • .claude/hooks/debate-research-reminder.sh
  • .claude/settings.json (hook configuration)
  • _master_templates/completion_checklist.json (phase tracking)

Documentation Updates:

  • debate_initialisation_prompt.md: Added checklist copy to Phase 0.5 init, added hooks notification
  • quick_start.md: Added checklist copy, added common scenario for hooks
  • post_debate_review.md: This entry

How Hooks Work:

The hooks are PostToolUse hooks that match on the Write tool. Each hook script:

  1. Receives JSON input via stdin with the file_path
  2. Checks if the file_path matches their target pattern (regex)
  3. Outputs a reminder box if matched, otherwise silent
  4. Always exits 0 (allows flow to continue, just provides reminder)

Expected Impact:

  1. Just-in-time reminders: Reminders appear exactly when relevant, not buried at the start
  2. Cannot be ignored by reading: The reminder appears after the action, forcing acknowledgment
  3. Cumulative with warnings: Hooks supplement (not replace) documentation warnings
  4. Reduces cognitive load: Orchestrator doesn't need to remember everything upfront

Limitations:

  • Hooks remind but don't enforce - orchestrator can still ignore
  • Only fires on Write tool, not on reading or other operations
  • Requires hooks to be installed in user's claude config
  • May produce redundant reminders if instructions were already followed

2025-12-11 - Meaning of Life (10 Participants, Philosophical Debate) - SUCCESS

Quality Criteria Evaluation:

A. Agent Behavior ✅ ALL PASSED

  • Agents quoted each other directly and rebutted specific claims (14+ attacks registered)
  • Agents used emotional, confrontational language ("spectacular self-refutation", "your God is morphine", "bad faith dressed in robes")
  • Agents avoided forbidden phrases (no "I understand your point..." observed)
  • Agents defended themselves when attacked (NIHILIST vs THEIST on self-refutation, etc.)
  • Agents stayed in character throughout (10 distinct philosophical voices)
  • Personas were distinct and authentic (BUDDHIST's questioning, SUFI's poetry, NIHILIST's bleakness)

B. Debate Dynamics ✅ ALL PASSED

  • Genuine back-and-forth exchanges (self-refutation debate spanned 3+ waves)
  • Debate felt heated ("spectacular self-refutation", "philosophical suicide", "morphine not medicine")
  • Surprising moments (NIHILIST conceding self-refutation tension, ABSURDIST acknowledging overlap with HUMANIST)
  • Real position shifts in reflections (8 documented shifts including NIHILIST, NATURALIST, EXISTENTIALIST)
  • Substantive contributions (5-8 sentences consistently, rich philosophical content)

C. System Compliance ✅ ALL PASSED

  • Sequential waves (2-4 agents per wave, 6 waves total)
  • State files updated after each wave (quality_assessment shows current_wave: 6)
  • Attack registry used (14 attacks tracked)
  • Quality gates checked before advancing (ready_for_next_phase verified)
  • Opus subagents used throughout
  • Agents read persona cards and debate rules (included in prompts)

D. Output Quality ✅ ALL PASSED

  • Synthesis captures key disagreements (5 axes, 8 position shifts)
  • Cruxes identified (4 empirical, 5 value conflicts)
  • PDF generated successfully (31KB, 7 pages)

Issues Observed:

  1. Minor: Attack registry resolution tracking incomplete - 14 attacks registered as unresolved but 0 moved to resolved, even though many were addressed in the debate. Should track resolutions more diligently.

  2. Minor: Contribution tracker counts not fully accurate - Most agents show 1-3 contributions but NIHILIST shows 3 (satisfied) despite multiple appearances. Tracking was maintained but not perfectly precise.

What Went Well:

  1. STATE FILE TRACKING WORKED - The four-step wave definition successfully ensured state tracking through all 6 waves

  2. Exceptional philosophical depth - Agents engaged with genuine philosophical arguments (self-refutation, anatta, absurdist/existentialist distinction)

  3. Strong reflections with real intellectual movement:

    • NIHILIST: "I cannot coherently privilege my 'clear-eyed' nihilism without smuggling in values I claim to reject"
    • NATURALIST: "I cannot coherently value truth-seeking while claiming values are illusions"
    • ABSURDIST: "The distinction between revolt and meaning-creation may be more aesthetic than philosophical"
  4. Research integration - Agents cited philosophers (Camus, Sartre, Epicurus, Marcus Aurelius), traditions (Theravada, Sufi), and made sophisticated arguments

  5. Cruxes were substantive - Both empirical (self-refutation problem, hard problem of consciousness) and value-based (truth vs wellbeing, permanence vs presence)

  6. PDF generated successfully - Professional 7-page report with XeLaTeX

Changes Made:

None required - the system worked as designed for philosophical debates.

Expected Impact:

This successful run validates that:

  1. The four-step wave definition continues to prevent state tracking failures
  2. The system handles philosophical/ethical debates as well as political/technical debates
  3. 10 agents can maintain distinct voices through 6 waves
  4. Opus subagents produce sophisticated philosophical engagement

Template Entry

YYYY-MM-DD - [Debate Topic]

Issues Observed:

Changes Made:

Expected Impact:


2025-12-11 - OpenAI 2026 Leadership (8 Participants) - SUCCESS

Quality Criteria Evaluation:

A. Agent Behavior ✅ ALL PASSED

  • Agents quoted each other directly and rebutted specific claims (20+ quote-rebut exchanges)
  • Agents used emotional, confrontational language ("That's simply wrong!", "absurd", "smoke and mirrors")
  • Agents avoided forbidden phrases (no "I understand your point..." type phrases)
  • Agents defended themselves when attacked (GOOGLE_BULL vs ANALYST, ANTHROPIC_FAN vs OPEN_SOURCE)
  • Agents stayed in character throughout (8 distinct personas maintained)
  • Personas were distinct and authentic (BEAR_OAI financial focus vs SAFETY_HAWK ethical focus)

B. Debate Dynamics ✅ ALL PASSED

  • Genuine back-and-forth exchanges (BEAR_OAI ↔ INVESTOR had 3+ turns on WeWork analogy)
  • Debate felt heated, not like polite panel (multiple "outrageous", "absurd" exchanges)
  • Surprising moments occurred (INVESTOR conceding WeWork parallel "has merit")
  • Real position shifts in reflections (SAFETY_HAWK admitted no coordination mechanism)
  • Substantive contributions (5-8 sentences consistently)

C. System Compliance ✅ ALL PASSED

  • Sequential waves (2-4 agents per wave, 5 waves total)
  • State files updated after each wave (quality_assessment shows current_wave: 5)
  • Attack registry used (15 attacks tracked: 11 unresolved, 4 resolved)
  • Quality gates checked before advancing (ready_for_next_phase verified)
  • Opus subagents used throughout
  • Agents read persona cards and debate rules (included in prompts)

D. Output Quality ✅ ALL PASSED

  • Synthesis captures key disagreements (6 cruxes identified)
  • Recommendations grounded in debate (predictions tied to positions)
  • Core cruxes identified (Platform vs Commodity, Unit Economics, Safety as Moat, etc.)

Issues Observed:

  1. Minor: Attack registry had more unresolved than resolved - 11 unresolved vs 4 resolved attacks. Some attacks were responded to but not formally tracked as resolved. Future improvement: be more diligent about moving attacks to "resolved" when responses occur.

  2. Minor: Wave 4-5 state updates less detailed - Earlier waves had more detailed tracking. By Wave 4-5, updates were correct but briefer. Not a problem, but shows slight fatigue.

What Went Well:

  1. STATE FILE TRACKING WORKED - For the first time in documented history, the four-step wave definition resulted in proper state tracking through all 5 waves.

  2. Research integration excellent - Agents cited specific statistics (80.9% SWE-bench, $207B funding gap, 2B AI Overviews users) from actual web research.

  3. Strong reflections with real concessions - Every agent identified something they learned:

    • BULL_OAI: "$207B funding gap is genuine structural risk"
    • BEAR_OAI: "OpenAI has real technology unlike WeWork"
    • SAFETY_HAWK: "I offered no realistic mechanism for coordinated slowdown"
    • INVESTOR: "My smart money argument echoed WeWork bull rhetoric"
  4. Cruxes were empirically testable - All 6 cruxes have falsification conditions that 2026 will answer.

  5. PDF generated successfully - 6-page professional report with LaTeX.

Changes Made:

None required - the system worked as designed!

Expected Impact:

This successful run validates the "Wave Redefinition" change made earlier today. The four-step wave definition (A: Launch, B: Analyze, C: Update State, D: Plan) successfully prevented the state tracking failures that plagued all previous debates.

Recommendation:

Keep the current system design. The key insight that worked: reframing state updates as integral to wave completion rather than post-wave administrative work.


2025-12-11 - EU AI Competitiveness 2025 (11 Participants)

Issues Observed:

  1. HOOKS DID NOT TRIGGER (CRITICAL): Despite hooks being configured in .claude/settings.json, no reminder messages appeared after writing to 05_debate_log.md, 03_positions.md, or 07_synthesis.md. Root cause: the hook commands used relative paths (.claude/hooks/script.sh) which don't work if Claude is invoked from a different working directory. Fixed by using $CLAUDE_PROJECT_DIR environment variable.

  2. State file tracking stopped at Wave 3: Despite running 5 waves, quality_assessment.json shows current_wave: 3 and contribution_tracker.json doesn't reflect Wave 4 and 5 contributions. The orchestrator stopped updating state files midway through the debate.

  3. Ran unnecessary rm command: The orchestrator ran rm -f *.aux *.log... to clean up LaTeX files, even though:

    • The documentation explicitly says "no rm needed - latexmk has built-in cleanup"
    • The command template already includes latexmk -c which cleans auxiliary files
    • The rm command failed because the aux files were in a different directory
  4. PDF naming inconsistent: Generated debate_report.pdf instead of the documented 08_final_report_eu_ai_competitiveness_2025.pdf format.

  5. Post-debate review forgotten: Had to be explicitly prompted to run Phase 8.

What Went Well:

  • Excellent agent behavior: All 11 agents quoted opponents directly, used confrontational language, stayed in character, and made substantive arguments (5-8 sentences)
  • Genuine back-and-forth: Multiple quote-and-rebut exchanges occurred (15+ by Wave 3 count)
  • Research integration: Agents did web searches and cited specific statistics (e.g., "$109B US investment", "23% startup relocation", "60% PhD exodus")
  • Real position shifts: Reflections showed genuine learning (RACER admitted ACADEMIC "demolished" his argument)
  • Opus quality: Using Opus subagents produced sophisticated, nuanced political roleplay
  • Sequential execution: Waves were properly sequential (2-4 agents at a time), not parallel
  • Synthesis quality: Final synthesis identified unexpected convergences and actionable recommendations

Changes Made:

.claude/settings.json:

  • Changed hook paths from relative (.claude/hooks/...) to use $CLAUDE_PROJECT_DIR environment variable
  • Increased timeout from 3 to 5 seconds for hook execution
  • Example: "command": "$CLAUDE_PROJECT_DIR/.claude/hooks/debate-state-update.sh"

quick_start.md (to be updated):

  • Add troubleshooting section for hooks not triggering

post_debate_review.md:

  • This entry documenting the issues

Root Cause Analysis:

Why state tracking stopped: The orchestrator got caught up in the "interesting" work of launching agents and reading their responses. State file updates feel like "administrative overhead" rather than core work. Despite MULTIPLE warnings in documentation, this pattern repeats because:

  • The reminder to update state files comes AFTER launching agents
  • By then, the orchestrator is already planning the next wave
  • There's no enforcement mechanism - hooks were supposed to provide reminders but didn't trigger

Expected Impact:

  1. Hook paths fixed: Using $CLAUDE_PROJECT_DIR ensures hooks work regardless of working directory
  2. Pattern documented: This is now the FOURTH documented instance of state tracking failure

Recommendations for Future:

  1. Redefine what a "wave" is: Make state updates an integral part of the wave definition, not an afterthought
  2. Add pre-wave verification: Before launching Wave N, verify current_wave == N-1 in quality_assessment.json - abort if stale
  3. PDF naming enforcement: Add the topic slug to a variable early and reference it consistently

2025-12-11 - Wave Redefinition (Addressing Persistent State Tracking Failure)

Motivation:

State file tracking has failed in EVERY documented debate (4 instances). Previous approaches that didn't work:

  • Prominent warnings in documentation (read once, then ignored)
  • Post-wave checklists (treated as optional)
  • "Verify before next wave" instructions (not followed)
  • Hooks to remind (didn't trigger due to approval requirements)

The root problem: Orchestrators mentally model "wave" as "launch agents" and treat state updates as separate administrative overhead that can be skipped.

Solution Implemented:

Redefined what a "wave" is. A wave is no longer "launch agents" - it's a four-step cycle:

STEP A: Launch 2-4 agents → wait for responses
STEP B: Read and analyze new contributions
STEP C: Update ALL state files
STEP D: Evaluate gates and plan next wave

A wave is NOT complete until Step D is done. Skipping Steps B-D means you haven't completed a wave - you've just launched agents into the void.

Added pre-wave verification as a HARD GATE: Before launching Wave N, read quality_assessment.json and verify current_wave == N-1. If not, ABORT - you have incomplete waves.

Changes Made:

debate_initialisation_prompt.md:

  • Removed hooks notification section (user-targeted instructions don't belong here)
  • Added "WHAT IS A WAVE?" box defining the four-step cycle
  • Updated sequential wave execution to reference Steps A→B→C→D
  • Renamed "MODERATOR UPDATE ROUTINE" to "WAVE STEPS B-C-D" (reframing as integral, not afterthought)
  • Replaced post-wave checklist with "Wave Completion Check"
  • Added "PRE-WAVE VERIFICATION (ABORT ON STALE STATE)" with hard gate
  • Added "WHY THIS MATTERS" section explaining consequences

quick_start.md:

  • Renamed #6 mistake to "Treating State Updates as Optional" (from "Skipping State File Updates")
  • Added WRONG vs CORRECT mental model comparison
  • Added four-step wave definition
  • Updated STEP 7 from "After Each Wave - Update State" to "Understanding Waves (The Four-Step Cycle)"
  • Replaced hook troubleshooting sections with "I forgot to do state updates—what now?"
  • Updated TL;DR checklist with four-step wave and hard gate verification

post_debate_review.md:

  • Removed user-targeted instructions (like "run /hooks")
  • Updated recommendations to reflect implemented changes

Expected Impact:

  1. Mental model shift: By redefining "wave" to include state updates, orchestrators can't think of them as optional
  2. Hard gate enforcement: Pre-wave verification creates an actual checkpoint that reveals skipped updates
  3. Clearer consequences: "You haven't completed a wave" is more compelling than "you skipped administrative work"
  4. Self-documenting: The four-step structure is repeated multiple times in multiple files

Note on Hooks:

Hooks remain in the codebase as a backup reminder system. They work when approved, but the documentation no longer relies on them. The four-step wave definition and pre-wave hard gate should be sufficient even without hooks triggering.