Skip to content

JuliaBench/LLMBenchTUI.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMBenchTUI

A terminal interface for benchmarking Large Language Models on coding tasks. Run single or parallel benchmark sessions with real-time monitoring and detailed scoring.

Quick Start

# Install dependencies
cd LLMBenchTUI
julia --project=. -e 'using Pkg; Pkg.instantiate()'

# Set your API key
export ANTHROPIC_API_KEY=your_api_key_here

# Run a benchmark
julia --project=. run_tui.jl --problem julia/csv_processing

Running Benchmarks

Single Session

Watch an LLM solve a problem in real-time:

julia --project=. run_tui.jl \
  --socket /tmp/llmbench.sock \
  --problem julia/csv_processing \
  --model claude-3-5-sonnet-20241022

Parallel Sessions

Run multiple sessions simultaneously to gather statistics:

julia --project=. run_parallel.jl \
  --socket /tmp/llmbench.sock \
  --problem julia/csv_processing \
  --count 10 \
  --model claude-3-5-sonnet-20241022

Command Options

run_tui.jl

  • -s, --socket PATH - Path to MCP server socket (default: /tmp/llmbench.sock)
  • -p, --problem ID - Problem to benchmark (e.g., julia/csv_processing)
  • -m, --model NAME - Model to use (default: claude-3-5-sonnet-20241022)
  • --max-iterations N - Optional iteration limit
  • --max-tokens N - Max tokens per response (default: 8192)

run_parallel.jl

  • -c, --count N - Number of parallel sessions (default: 5)
  • All options from run_tui.jl

Available Problems

Check the problems/ directory in LLMBenchMCPServer for available benchmarks:

  • julia/csv_processing - CSV data manipulation
  • julia/package_creation - Create a Julia package
  • python/data_analysis - Python data analysis tasks
  • And more...

Understanding the Display

Single Session View

The screen shows four panels:

Top Left - Status

  • Current state (connecting → running → completed)
  • Problem being solved
  • Iteration counter

Top Right - Results

  • Grade percentage with color coding:
    • 🟢 Green (90-100%): Excellent
    • 🟡 Yellow (70-89%): Good
    • 🟠 Orange (50-69%): Passing
    • 🔴 Red (0-49%): Needs improvement
  • Individual test scores
  • Total time taken

Middle - Conversation

  • Live chat between LLM and system
  • Shows last 5 messages

Bottom - Tool Calls

  • Commands being executed
  • ✓ Success / ✗ Error / ⟳ Running

Parallel Dashboard

Header

  • Session count and completion progress
  • Weighted pass rate (average of all grades)
  • Overall progress bar

Session Grid

  • Each session shows:
    • Status icon (🔄 Running, ✅ Done, ❌ Failed, ⏱️ Hit limit)
    • Grade percentage
    • Iteration count
    • Progress bar

Output Files

Results are saved automatically:

  • transcript_<problem>_<timestamp>.json - Single session details
  • parallel_results_<problem>_<timestamp>.json - Parallel run summary

Server Setup

First start the benchmark server:

cd /path/to/LLMBenchMCPServer
julia --project=. -e 'using LLMBenchMCPServer; LLMBenchMCPServer.run_socket_server("/tmp/llmbench.sock")'

Then run the TUI client as shown above.

Tips

  • Use parallel mode to test consistency across multiple runs
  • No iteration limit by default - sessions run until the problem is solved or fails
  • Add --max-iterations to prevent runaway sessions
  • Transcripts contain full conversation history for debugging

Troubleshooting

"Connection refused" error

  • Make sure the MCP server is running first
  • Check the socket path matches between server and client

"API key not found" error

  • Set ANTHROPIC_API_KEY environment variable
  • Or use --api-key flag

License

MIT

About

TUI.jl for LLMBench compatible rollouts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages