Add multilingual benchmarking#267
Conversation
allenporter
left a comment
There was a problem hiding this comment.
Generally seems fine though i saw that there are two CI failures
(1) there is a codespell error
(2) i think some snapshots are failing
| expect_changes: | ||
| light.kitchen_light: | ||
| state: "on" | ||
| # Helligkeitszustand ignorieren, Integration stellt ihn wieder her |
There was a problem hiding this comment.
Not sure if its super useful to translate the comments too :)
|
There are still some test failures. I think there maybe something wrong with the linting configuration in the pre-commit that caused this? May need to manually fix. |
Yes that's totally the issue and drove me mad last week. The different linting approach in pre-commit and CI is causing this. I'm thinking if they should be aligned so this doesn't happen again in the future. |
Description
This PR adds multilingual benchmark support.
There are a few decisions that might be questionable but I think it's a simple first approach.
It adds a first multilingual benchmark for
devstral-2512.assist-<language>directories for each dataset language so we don't change the existing structure. We can have a follow-up PR that will create a tree structure:assist,assist-mini,questionsandautomations) a new section is created with a table having models as rows and languages as columns. Again we can think on how to repurpose this but as groundwork I think it's good enough.Let me know what do you think and if we can improve it.