This project implements a groundbreaking automated coding tournament where multiple advanced Large Language Models (LLMs) collaboratively compete and iteratively refine their solutions to complex coding challenges across multiple rounds. It's designed to harness collective intelligence, pushing LLM-generated code toward unprecedented quality and efficiency through continual peer feedback and integration.
The best way to understand this project and where it came from is to read the article that led directly to its creation: LLM Multi-Round Coding Tournament.
Traditional methods for utilizing LLMs in software development typically involve single-query, single-response interactions. This project transforms that paradigm by:
- Iterative Collaboration: Each model critically analyzes and improves upon solutions generated by others, progressively enhancing code quality across rounds.
- Automated Code Synthesis: Merges the strongest features from multiple solutions into optimized, robust, and elegant code.
- Deep Analysis: Tracks comprehensive performance metrics, including complexity estimates, execution efficiency, and solution robustness, enabling fine-grained insight into model capabilities.
- Multi-Round Refinement: Models iteratively build upon and refine each other's solutions.
- Adaptive Prompt Engineering: Automatically constructs sophisticated prompts to guide models toward meaningful integration and optimization of solutions.
- Robust Code Extraction: Utilizes AI-driven extraction to structure raw LLM outputs into well-formed, executable Python classes.
- Concurrent Execution & Error Handling: Manages multiple LLM queries simultaneously, with built-in retry logic and error recovery for reliability.
- Automated Testing & Metrics Collection: Generates automated test suites and comprehensive metrics to objectively evaluate each solution.
- Analytics: Produces detailed markdown reports to clearly represent improvements and performance across tournament rounds.
The script employs an innovative approach by harnessing LLM capabilities to convert raw model responses into structured, executable Python code:
- LLM-powered prompts precisely instruct AI to encapsulate solution code into cohesive, self-contained classes.
- Resulting code is cached and versioned, ensuring consistency and ease of reuse across rounds.
Each tournament round dynamically synthesizes prompts by carefully analyzing solutions from the previous round, ensuring each iteration meaningfully integrates new insights and improvements. This method maximizes model performance by systematically leveraging collective strengths.
The project meticulously tracks and computes metrics such as:
- Code complexity (functions, classes, decision points)
- Efficiency metrics (execution time, output size)
- Robustness indicators (error handling, edge-case coverage)
These metrics are critical for evaluating not just correctness, but the overall quality and maintainability of generated solutions.
- Python 3.13
- API keys for LLM providers (Anthropic, OpenAI, Mistral)
-
Clone the repository:
git clone https://github.com/Dicklesworthstone/llm-tournament cd llm-tournament
-
Install dependencies using
uv
:curl -LsSf https://astral.sh/uv/install.sh | sh uv venv --python 3.13 source .venv/bin/activate uv pip install -r requirements.txt
-
Configure API keys in the
.env
file:ANTHROPIC_API_KEY="your_anthropic_key" OPENAI_API_KEY="your_openai_key" MISTRAL_API_KEY="your_mistral_key"
llm_tournament.py
: Main automation script orchestrating the tournament.challenge_prompt.md
: Initial coding challenge.messy_csv_sample.csv
: Test data for evaluating solution accuracy.README.md
: Comprehensive project documentation.
Run a full tournament cycle with default parameters:
python llm_tournament.py --prompt challenge_prompt.md --test-file messy_csv_sample.csv
If previous responses exist, the script intelligently skips redundant calls, ensuring efficiency.
Customize your tournament with detailed control:
python llm_tournament.py --prompt challenge_prompt.md --test-file messy_csv_sample.csv --rounds 3 --temperature 0.8 --concurrent-requests 4 --verbose
--prompt
: Initial challenge prompt (required).--rounds
: Number of iterative refinement rounds (default: 5).--output-dir
: Directory for storing tournament artifacts (default:tournament_results
).--test-file
: Test file for validating solutions.--temperature
: Controls creativity/randomness of LLM responses (default: 0.7).--concurrent-requests
: Limits concurrent API calls (default: 4).--skip-tests
: Skips solution validation tests.--verbose
: Enables detailed logging.
Running the script generates a comprehensive suite of outputs:
- Individual model solutions organized by rounds.
- Hybrid synthesized solutions representing collective model intelligence.
- Detailed performance metrics, visual analytics, and insightful markdown reports.
- A test harness facilitating straightforward evaluation and comparison of solutions.
A typical execution:
python llm_tournament.py --prompt challenge_prompt.md --test-file messy_csv_sample.csv --rounds 3
This workflow will:
- Prompt multiple LLMs with the challenge.
- Collect, analyze, and refine solutions iteratively.
- Automatically test each refined solution.
- Generate detailed performance reports and visualization of improvement.
MIT License