Skip to content

Drizzt321/ha-voiceagent-llm-benchmark

Repository files navigation

HA Voice LLM Benchmark

A benchmarking framework that evaluates local LLM models for Home Assistant voice control using Inspect AI.

What It Does

Given a voice command and a smart home entity inventory, does the model call the right HA intent tool with the right arguments? The framework sends HA-formatted prompts to a local llama.cpp server and scores the model's tool-call responses against expected results — without going through the full HA pipeline.

Quick Start

# Install dependencies (--extra dev includes pytest, ruff, etc.)
uv sync --extra dev

# Run tests
pytest

# Lint
ruff check .

# Run a benchmark eval (see docs/user-setup.md for server setup)
cp docs/env.example .env
# Edit .env with your server address
uv run inspect eval src/ha_voice_bench/task.py --model openai/local --max-connections 1 --display plain

Architecture

See docs/architecture.md for the full design.

Test Cases (NDJSON) + Inventory (YAML)
        ↓
  Dataset Loader       →  Inspect Samples
  Solver               →  HA system prompt + tool defs + generate()
  Tier 1 Scorer        →  multi-dimensional C/I/N scores
        ↓
  inspect view         →  browse results in browser

Project Layout

src/ha_voice_bench/
├── dataset.py          # NDJSON loader → Inspect Samples
├── tools.py            # HA intent ToolDef objects (11 MVP / 31 full)
├── prompt.py           # System prompt assembly with entity inventory
├── solver.py           # Inspect Solver: wires prompt + tools + generate()
├── task.py             # Inspect Task entry point
└── scorers/
    └── tool_call.py    # Tier 1: tool-call correctness (multi-dimensional)

sample_test_data/
├── small.yaml               # 10-entity, 6-area inventory
├── small_test_cases.ndjson  # 25 test cases
├── sample_inventory.yaml    # 2-entity fixture for unit tests
└── sample_test_cases.ndjson # 5-case fixture for unit tests

tests/
├── conftest.py              # Shared fixtures (sample data paths)
├── test_dataset.py          # Dataset loader tests
├── test_tools.py            # Tool definition tests
├── test_prompt.py           # Prompt assembly tests
└── test_scorers.py          # Tier 1 scorer tests

docs/                        # Architecture, specs, implementation plan
logs/                        # Inspect eval logs (gitignored)
configs/                     # Per-host server configs

Infrastructure

  • LLM Server: llama.cpp (llama-server), OpenAI-compatible API (default: http://localhost:8080/v1)
  • Models: GGUF format, Qwen 2.5 family recommended for tool-calling capability
  • Inspect AI: >=0.3.184,<0.4

Docs

File Contents
docs/architecture.md System design and key decisions
docs/ha-prompt-reference.md HA prompt and tool schema format
docs/test-data-format.md NDJSON and YAML schemas
docs/scoring-design.md Multi-dimensional scoring explained
docs/implementation-plan-m1.md M1 step-by-step implementation plan
docs/gotchas_learnings.md Inspect AI gotchas and implementation learnings
docs/user-setup.md Server setup, env config, and running your first eval
docs/failure-patterns.md Taxonomy of model failure modes observed across runs

License

Apache-2.0

About

A Home-Assistant Voice Agent LLM benchmark setup, designed to performance and quality test LLMs running on a remote llama.cpp server to score and determine quality of results.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages