A terminal interface for benchmarking Large Language Models on coding tasks. Run single or parallel benchmark sessions with real-time monitoring and detailed scoring.
# Install dependencies
cd LLMBenchTUI
julia --project=. -e 'using Pkg; Pkg.instantiate()'
# Set your API key
export ANTHROPIC_API_KEY=your_api_key_here
# Run a benchmark
julia --project=. run_tui.jl --problem julia/csv_processingWatch an LLM solve a problem in real-time:
julia --project=. run_tui.jl \
--socket /tmp/llmbench.sock \
--problem julia/csv_processing \
--model claude-3-5-sonnet-20241022Run multiple sessions simultaneously to gather statistics:
julia --project=. run_parallel.jl \
--socket /tmp/llmbench.sock \
--problem julia/csv_processing \
--count 10 \
--model claude-3-5-sonnet-20241022-s, --socket PATH- Path to MCP server socket (default:/tmp/llmbench.sock)-p, --problem ID- Problem to benchmark (e.g.,julia/csv_processing)-m, --model NAME- Model to use (default:claude-3-5-sonnet-20241022)--max-iterations N- Optional iteration limit--max-tokens N- Max tokens per response (default: 8192)
-c, --count N- Number of parallel sessions (default: 5)- All options from run_tui.jl
Check the problems/ directory in LLMBenchMCPServer for available benchmarks:
julia/csv_processing- CSV data manipulationjulia/package_creation- Create a Julia packagepython/data_analysis- Python data analysis tasks- And more...
The screen shows four panels:
Top Left - Status
- Current state (connecting → running → completed)
- Problem being solved
- Iteration counter
Top Right - Results
- Grade percentage with color coding:
- 🟢 Green (90-100%): Excellent
- 🟡 Yellow (70-89%): Good
- 🟠 Orange (50-69%): Passing
- 🔴 Red (0-49%): Needs improvement
- Individual test scores
- Total time taken
Middle - Conversation
- Live chat between LLM and system
- Shows last 5 messages
Bottom - Tool Calls
- Commands being executed
- ✓ Success / ✗ Error / ⟳ Running
Header
- Session count and completion progress
- Weighted pass rate (average of all grades)
- Overall progress bar
Session Grid
- Each session shows:
- Status icon (🔄 Running, ✅ Done, ❌ Failed, ⏱️ Hit limit)
- Grade percentage
- Iteration count
- Progress bar
Results are saved automatically:
transcript_<problem>_<timestamp>.json- Single session detailsparallel_results_<problem>_<timestamp>.json- Parallel run summary
First start the benchmark server:
cd /path/to/LLMBenchMCPServer
julia --project=. -e 'using LLMBenchMCPServer; LLMBenchMCPServer.run_socket_server("/tmp/llmbench.sock")'Then run the TUI client as shown above.
- Use parallel mode to test consistency across multiple runs
- No iteration limit by default - sessions run until the problem is solved or fails
- Add
--max-iterationsto prevent runaway sessions - Transcripts contain full conversation history for debugging
"Connection refused" error
- Make sure the MCP server is running first
- Check the socket path matches between server and client
"API key not found" error
- Set
ANTHROPIC_API_KEYenvironment variable - Or use
--api-keyflag
MIT