LLMBenchTUI

A terminal interface for benchmarking Large Language Models on coding tasks. Run single or parallel benchmark sessions with real-time monitoring and detailed scoring.

Quick Start

# Install dependencies
cd LLMBenchTUI
julia --project=. -e 'using Pkg; Pkg.instantiate()'

# Set your API key
export ANTHROPIC_API_KEY=your_api_key_here

# Run a benchmark
julia --project=. run_tui.jl --problem julia/csv_processing

Running Benchmarks

Single Session

Watch an LLM solve a problem in real-time:

julia --project=. run_tui.jl \
  --socket /tmp/llmbench.sock \
  --problem julia/csv_processing \
  --model claude-3-5-sonnet-20241022

Parallel Sessions

Run multiple sessions simultaneously to gather statistics:

julia --project=. run_parallel.jl \
  --socket /tmp/llmbench.sock \
  --problem julia/csv_processing \
  --count 10 \
  --model claude-3-5-sonnet-20241022

Command Options

run_tui.jl

-s, --socket PATH - Path to MCP server socket (default: /tmp/llmbench.sock)
-p, --problem ID - Problem to benchmark (e.g., julia/csv_processing)
-m, --model NAME - Model to use (default: claude-3-5-sonnet-20241022)
--max-iterations N - Optional iteration limit
--max-tokens N - Max tokens per response (default: 8192)

run_parallel.jl

-c, --count N - Number of parallel sessions (default: 5)
All options from run_tui.jl

Available Problems

Check the problems/ directory in LLMBenchMCPServer for available benchmarks:

julia/csv_processing - CSV data manipulation
julia/package_creation - Create a Julia package
python/data_analysis - Python data analysis tasks
And more...

Understanding the Display

Single Session View

The screen shows four panels:

Top Left - Status

Current state (connecting → running → completed)
Problem being solved
Iteration counter

Top Right - Results

Grade percentage with color coding:
- 🟢 Green (90-100%): Excellent
- 🟡 Yellow (70-89%): Good
- 🟠 Orange (50-69%): Passing
- 🔴 Red (0-49%): Needs improvement
Individual test scores
Total time taken

Middle - Conversation

Live chat between LLM and system
Shows last 5 messages

Bottom - Tool Calls

Commands being executed
✓ Success / ✗ Error / ⟳ Running

Parallel Dashboard

Header

Session count and completion progress
Weighted pass rate (average of all grades)
Overall progress bar

Session Grid

Each session shows:
- Status icon (🔄 Running, ✅ Done, ❌ Failed, ⏱️ Hit limit)
- Grade percentage
- Iteration count
- Progress bar

Output Files

Results are saved automatically:

transcript_<problem>_<timestamp>.json - Single session details
parallel_results_<problem>_<timestamp>.json - Parallel run summary

Server Setup

First start the benchmark server:

cd /path/to/LLMBenchMCPServer
julia --project=. -e 'using LLMBenchMCPServer; LLMBenchMCPServer.run_socket_server("/tmp/llmbench.sock")'

Then run the TUI client as shown above.

Tips

Use parallel mode to test consistency across multiple runs
No iteration limit by default - sessions run until the problem is solved or fails
Add --max-iterations to prevent runaway sessions
Transcripts contain full conversation history for debugging

Troubleshooting

"Connection refused" error

Make sure the MCP server is running first
Check the socket path matches between server and client

"API key not found" error

Set ANTHROPIC_API_KEY environment variable
Or use --api-key flag

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMBenchTUI

Quick Start

Running Benchmarks

Single Session

Parallel Sessions

Command Options

run_tui.jl

run_parallel.jl

Available Problems

Understanding the Display

Single Session View

Parallel Dashboard

Output Files

Server Setup

Tips

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

License

JuliaBench/LLMBenchTUI.jl

Folders and files

Latest commit

History

Repository files navigation

LLMBenchTUI

Quick Start

Running Benchmarks

Single Session

Parallel Sessions

Command Options

run_tui.jl

run_parallel.jl

Available Problems

Understanding the Display

Single Session View

Parallel Dashboard

Output Files

Server Setup

Tips

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages