Skip to content

Add benchmark script facade#268

Draft
victorigualada wants to merge 11 commits intoallenporter:mainfrom
victorigualada:add-benchmark-script-facade
Draft

Add benchmark script facade#268
victorigualada wants to merge 11 commits intoallenporter:mainfrom
victorigualada:add-benchmark-script-facade

Conversation

@victorigualada
Copy link
Contributor

Description

This PR adds a benchmark script to act as a single entrypoint for collecting outputs, evaluating them and building the leaderboards.

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 0.36364% with 274 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.35%. Comparing base (7ab8fc2) to head (a88664a).

Files with missing lines Patch % Lines
home_assistant_datasets/tool/benchmark/_common.py 0.00% 180 Missing ⚠️
home_assistant_datasets/tool/benchmark/collect.py 0.00% 27 Missing ⚠️
home_assistant_datasets/tool/benchmark/eval.py 0.00% 26 Missing ⚠️
home_assistant_datasets/tool/benchmark/all.py 0.00% 24 Missing ⚠️
...e_assistant_datasets/tool/benchmark/leaderboard.py 0.00% 13 Missing ⚠️
home_assistant_datasets/tool/benchmark/__init__.py 0.00% 3 Missing ⚠️
home_assistant_datasets/tool/__main__.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (7ab8fc2) and HEAD (a88664a). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (7ab8fc2) HEAD (a88664a)
2 1
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
- Coverage   56.88%   50.35%   -6.53%     
==========================================
  Files          51       57       +6     
  Lines        2099     2373     +274     
==========================================
+ Hits         1194     1195       +1     
- Misses        905     1178     +273     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Owner

@allenporter allenporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea to make these scripts simpler!

script/benchmark Outdated
REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent

# Add repo root to path so we can import project modules
sys.path.insert(0, str(REPO_ROOT))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. Seems like just running in a virtual environment should be able to solve this. I can include a script/setup if needed.

script/benchmark Outdated
Comment on lines +69 to +72
_ASSIST_FAMILY_SET: set[str] = set(ASSIST_FAMILY_DATASETS)
for _ds in ASSIST_FAMILY_DATASETS:
for _lang in LANGUAGES:
_ASSIST_FAMILY_SET.add(f"{_ds}-{_lang}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create this once without updating? e.g. something like this

Suggested change
_ASSIST_FAMILY_SET: set[str] = set(ASSIST_FAMILY_DATASETS)
for _ds in ASSIST_FAMILY_DATASETS:
for _lang in LANGUAGES:
_ASSIST_FAMILY_SET.add(f"{_ds}-{_lang}")
_ASSIST_FAMILY_SET: set[str] = set({ASSIST_FAMILY_DATASETS}) | set({
f"{_ds}-{_lang}"
for _ds in ASSIST_FAMILY_DATASETS:
for _lang in LANGUAGES
})

script/benchmark Outdated
def get_ha_version() -> str:
"""Get the installed Home Assistant version."""
try:
from importlib.metadata import version
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be at the top of the file?

script/benchmark Outdated
raise SystemExit(1) # unreachable, for type checker


def build_env(synthetic_home_dir: pathlib.Path) -> dict[str, str]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I might rather this setup be some other step/script to prepare the environment? It feels weird to me to much with PYTHONPATH in the middle of this script

script/benchmark Outdated

tasks = _build_collect_tasks(datasets, model, dry_run=dry_run)

if parallel and len(tasks) > 1:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if this didnt fork two separate paths that run all the same setup code (here and below). Can we have one path with a maximum number that run in parallel at once so we can have the same code do the base case of 1 at a time and the parallel case?

script/benchmark Outdated
)


def main() -> int:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see this create a new CLI instead of improving the one that exists?

It is setup already with different subcommands and modules so you can put each subcommand in a separate package rather than all in one really large file.

Have you considered just adding a collect and eval subcommand in the existing tool?

(fine to keep openrouter script separate)

Move benchmark logic from standalone script/benchmark into
home_assistant_datasets.tool.benchmark package, following the
existing subcommand pattern (create_arguments + run).

The script/benchmark file becomes a thin bash wrapper that
delegates to: python -m home_assistant_datasets.tool benchmark
Instead of modifying PYTHONPATH at runtime to make the synthetic home
custom component importable, have script/bootstrap create a .pth file
in the venv's site-packages pointing to the synthetic home checkout.

This removes build_env(), find_synthetic_home(), and the
--synthetic-home-dir flag from the benchmark CLI.
The .pth file linking the synthetic home custom component belongs in
.devcontainer/setup (environment creation) rather than script/bootstrap
(dependency installation).
Add run_tasks() that handles both parallel and sequential modes,
removing duplicated if/else branches from collect.py and eval.py.
@victorigualada victorigualada force-pushed the add-benchmark-script-facade branch from 7a33e8b to 7be8ed1 Compare March 9, 2026 10:59
Add missing NoReturn import in script/openrouter_fetch_model.py and
align pre-commit codespell excludes with CI config (skip datasets/
and .ipynb files).
Copy link
Owner

@allenporter allenporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice work, thank you so much for making this huge usability improvement. One more round of nits/style things.

for candidate in "${SYNTHETIC_HOME_CANDIDATES[@]}"; do
if [ -d "$candidate/custom_components/synthetic_home" ]; then
SITE_PACKAGES=$(python3 -c "import sysconfig; print(sysconfig.get_path('purelib'))")
PTH_FILE="${SITE_PACKAGES}/synthetic-home-component.pth"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen this before. Is this legit?

Our current instructions say to set PYTHONPATH which seems more the norm.

@@ -0,0 +1,23 @@
"""Benchmark subcommand.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to explain what it is here? compared to the other tools

def get_ha_version() -> str:
"""Get the installed Home Assistant version."""
try:
from importlib.metadata import version
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports should be at the top of the file

def _read_progress(log_file: pathlib.Path) -> str:
"""Read the last pytest progress percentage from a log file."""
try:
with open(log_file, "r") as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you have a Path here this can use log_file.open(...) here and below.

Comment on lines +87 to +88
matches = re.findall(r"\[\s*(\d+)%\]", tail)
if matches:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matches = re.findall(r"\[\s*(\d+)%\]", tail)
if matches:
if matches := re.findall(r"\[\s*(\d+)%\]", tail)

"""Check that a dataset directory exists."""
ds_dir = DATASETS_DIR / dataset_name
if not ds_dir.is_dir():
die(f"Dataset directory not found: {ds_dir}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal but i'd rather leverage exceptions. While it's a CLI, it's still a library.

if accept_rc is None:
accept_rc = {0}

if parallel and len(tasks) > 1:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have two different ways to start the tests, parallel and non-parallel. I can provide more specific suggestions if needed.

"""
if accept_rc is None:
accept_rc = {0}
if dry_run:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say share more code between the two paths for dry run or not. I can give specific suggestions if that would be helpful.

time.sleep(1)

# Final summary (non-TTY or to ensure it's visible)
if not is_tty:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: you could have some status writer like thing to abstract away the difference betwen tty and not tty e.g.

with two different implementations.

"""Home Assistant Datasets command line tools.

```
usage: home-assistant-datasets [-h] [--debug] {leaderboard,convert} ...
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh this too if you don't mind with running with --help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants