Add multilingual benchmarking by victorigualada · Pull Request #267 · allenporter/home-assistant-datasets

victorigualada · 2026-03-02T10:59:05Z

Description

This PR adds multilingual benchmark support.
There are a few decisions that might be questionable but I think it's a simple first approach.
It adds a first multilingual benchmark for devstral-2512.

I decided to create assist-<language> directories for each dataset language so we don't change the existing structure. We can have a follow-up PR that will create a tree structure:

|- assist
  |- en
  |- es
  |- fr
|- assist-mini
  |- en
  |- es
  |- fr
...

The current leaderboard stays as-is with english only results. Then for each dataset (assist, assist-mini, questions and automations) a new section is created with a table having models as rows and languages as columns. Again we can think on how to repurpose this but as groundwork I think it's good enough.

Let me know what do you think and if we can improve it.

allenporter

Generally seems fine though i saw that there are two CI failures
(1) there is a codespell error
(2) i think some snapshots are failing

allenporter · 2026-03-03T06:12:44Z

datasets/assist-de/dom1-pl/lights.yaml

+    expect_changes:
+      light.kitchen_light:
+        state: "on"
+    # Helligkeitszustand ignorieren, Integration stellt ihn wieder her


Not sure if its super useful to translate the comments too :)

.github/ISSUE_TEMPLATE/benchmark-run.yml

allenporter · 2026-03-06T03:16:30Z

There are still some test failures. I think there maybe something wrong with the linting configuration in the pre-commit that caused this? May need to manually fix.

victorigualada · 2026-03-09T08:08:33Z

There are still some test failures. I think there maybe something wrong with the linting configuration in the pre-commit that caused this? May need to manually fix.

Yes that's totally the issue and drove me mad last week. The different linting approach in pre-commit and CI is causing this. I'm thinking if they should be aligned so this doesn't happen again in the future.

victorigualada added 12 commits January 30, 2026 13:40

openrouter benchmark action

4abf721

lint fix

d108821

fix

a550b32

merge upstream

bef415b

add multilingual logic

0eae050

add required openrouter python package

36f6d18

add translations

ce57535

add translated datasets

3a355ba

translate automation datasets

a44811d

remove bad workflow

e9cb398

update scripts for automation datasets

77416c7

add benchmark for fr, de and nl

b005f3f

allenporter reviewed Mar 3, 2026

View reviewed changes

victorigualada added 6 commits March 3, 2026 21:32

fixes

5bfed7a

some more fixes

d9d5a98

fix CI

cf046ef

untranslate comments

d87fe97

update snapshot script so it doesn't blow on snapshot update

7b5fc56

undo

715170d

allenporter approved these changes Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilingual benchmarking#267

Add multilingual benchmarking#267
victorigualada wants to merge 18 commits intoallenporter:mainfrom
victorigualada:add-multilingual-benchmarking

victorigualada commented Mar 2, 2026 •

edited

Loading

Uh oh!

allenporter left a comment

Uh oh!

allenporter Mar 3, 2026

Uh oh!

Uh oh!

allenporter commented Mar 6, 2026

Uh oh!

victorigualada commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

victorigualada commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

allenporter left a comment

Choose a reason for hiding this comment

Uh oh!

allenporter Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allenporter commented Mar 6, 2026

Uh oh!

victorigualada commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

victorigualada commented Mar 2, 2026 •

edited

Loading