A set of benchmarks to measure the impact of Routerly's intelligent routing on quality, cost, and latency compared to calling LLM models directly.
An AI router intercepts every request and assigns it to the most suitable model based on the estimated complexity of the task. The expected outcome is accuracy close to premium models at a significantly lower average cost, because simple requests are served by cheaper models.
The three benchmarks cover different dimensions:
| Benchmark | Task | Dataset | Main metric |
|---|---|---|---|
| MMLU | Multi-subject reasoning | MMLU (57 subjects, multiple choice) | Accuracy (%) |
| HumanEval | Python code generation | OpenAI HumanEval (164 problems) | pass@1 (%) |
| BIRD | SQL generation on real databases | BIRD text-to-SQL (95 databases) | Execution accuracy (%) |
Latest comparative run results on reference models:
| Model | Accuracy |
|---|---|
| claude-sonnet-4-6 | 96.7% |
| claude-opus-4-6 | 93.3% |
| gpt-4.1-nano | 70.0% |
| Model | pass@1 |
|---|---|
| claude-sonnet-4-6 | 96.7% |
| claude-opus-4-6 | 86.7% |
| gpt-4.1-nano | 76.7% |
| Model | Accuracy |
|---|---|
| claude-opus-4-6 | 66.7% |
| claude-sonnet-4-6 | 56.7% |
| gpt-4.1-nano | 36.7% |
demo/
├── README.md # this file
├── MMLU Benchmark/
│ ├── benchmark.py
│ ├── README
│ ├── requirements.txt
│ ├── .env_* # credentials per target
│ └── results/ # JSON run outputs
├── HumanEval/
│ ├── benchmark.py
│ ├── README
│ ├── requirements.txt
│ ├── .env_*
│ └── results/
└── BIRD/
├── benchmark.py
├── README
├── requirements.txt
├── .env_*
├── bird/ # BIRD dataset (manual download)
└── results/
Every benchmark is fully agnostic: it reads BASE_URL, API_KEY, and MODEL from a .env file and works identically against any OpenAI-compatible endpoint.
| Target | .env file |
|---|---|
| Routerly | .env_routerly |
| Anthropic Claude Opus | .env_anthropic_opus |
| Anthropic Claude Sonnet | .env_anthropic_sonnet |
| OpenAI GPT-4.1-nano | .env_openai_41-nano |
Each benchmark has its own virtualenv. Example for MMLU:
cd "MMLU Benchmark"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtUsing the same --seed across all runs guarantees the same question subset — essential for valid comparisons.
# Terminal 1
python benchmark.py --env .env_routerly --seed 42
# Terminal 2
python benchmark.py --env .env_anthropic_opus --seed 42
# Terminal 3
python benchmark.py --env .env_anthropic_sonnet --seed 42
# Terminal 4
python benchmark.py --env .env_openai_41-nano --seed 42Results are saved to results/<timestamp>_<env_label>.json.
Each subdirectory contains a ready-to-use .env.example. Copy it to the target you want to run:
cp .env.example .env_routerly # Routerly router
cp .env.example .env_anthropic_opus # Claude Opus direct
cp .env.example .env_anthropic_sonnet # Claude Sonnet direct
cp .env.example .env_openai_41-nano # GPT-4.1-nano directThen edit the copy with the correct BASE_URL, API_KEY, and MODEL:
| Target | BASE_URL |
MODEL |
|---|---|---|
| Routerly | https://api.routerly.ai/v1 |
auto |
| Anthropic Claude Opus | https://api.anthropic.com/v1 |
claude-opus-4-6 |
| Anthropic Claude Sonnet | https://api.anthropic.com/v1 |
claude-sonnet-4-6 |
| OpenAI GPT-4.1-nano | https://api.openai.com/v1 |
gpt-4.1-nano |
Optional variables:
SHOW_COST=true
PRICE_INPUT=15.0 # $/M input tokens
PRICE_OUTPUT=75.0 # $/M output tokens
REASONING_EFFORT=low # low / medium / high (models that support it).env_* files must not be committed — they are already listed in .gitignore.