A benchmarking framework that evaluates local LLM models for Home Assistant voice control using Inspect AI.
Given a voice command and a smart home entity inventory, does the model call the right HA intent tool with the right arguments? The framework sends HA-formatted prompts to a local llama.cpp server and scores the model's tool-call responses against expected results — without going through the full HA pipeline.
# Install dependencies (--extra dev includes pytest, ruff, etc.)
uv sync --extra dev
# Run tests
pytest
# Lint
ruff check .
# Run a benchmark eval (see docs/user-setup.md for server setup)
cp docs/env.example .env
# Edit .env with your server address
uv run inspect eval src/ha_voice_bench/task.py --model openai/local --max-connections 1 --display plainSee docs/architecture.md for the full design.
Test Cases (NDJSON) + Inventory (YAML)
↓
Dataset Loader → Inspect Samples
Solver → HA system prompt + tool defs + generate()
Tier 1 Scorer → multi-dimensional C/I/N scores
↓
inspect view → browse results in browser
src/ha_voice_bench/
├── dataset.py # NDJSON loader → Inspect Samples
├── tools.py # HA intent ToolDef objects (11 MVP / 31 full)
├── prompt.py # System prompt assembly with entity inventory
├── solver.py # Inspect Solver: wires prompt + tools + generate()
├── task.py # Inspect Task entry point
└── scorers/
└── tool_call.py # Tier 1: tool-call correctness (multi-dimensional)
sample_test_data/
├── small.yaml # 10-entity, 6-area inventory
├── small_test_cases.ndjson # 25 test cases
├── sample_inventory.yaml # 2-entity fixture for unit tests
└── sample_test_cases.ndjson # 5-case fixture for unit tests
tests/
├── conftest.py # Shared fixtures (sample data paths)
├── test_dataset.py # Dataset loader tests
├── test_tools.py # Tool definition tests
├── test_prompt.py # Prompt assembly tests
└── test_scorers.py # Tier 1 scorer tests
docs/ # Architecture, specs, implementation plan
logs/ # Inspect eval logs (gitignored)
configs/ # Per-host server configs
- LLM Server: llama.cpp (
llama-server), OpenAI-compatible API (default:http://localhost:8080/v1) - Models: GGUF format, Qwen 2.5 family recommended for tool-calling capability
- Inspect AI:
>=0.3.184,<0.4
| File | Contents |
|---|---|
docs/architecture.md |
System design and key decisions |
docs/ha-prompt-reference.md |
HA prompt and tool schema format |
docs/test-data-format.md |
NDJSON and YAML schemas |
docs/scoring-design.md |
Multi-dimensional scoring explained |
docs/implementation-plan-m1.md |
M1 step-by-step implementation plan |
docs/gotchas_learnings.md |
Inspect AI gotchas and implementation learnings |
docs/user-setup.md |
Server setup, env config, and running your first eval |
docs/failure-patterns.md |
Taxonomy of model failure modes observed across runs |
Apache-2.0