Benchmark: Improved stats, now also printing stats for each individual test #3780

Michal-Mikolas · 2025-04-11T13:43:05Z

For more detailed info about benchmark and for better benchmark debugging purpose I've added feature to print more detailed stats after the benchmark is done (or when running benchmark.py --stats).

The change is backward compatible with the current results, no need to re-run the old benchmarks to see these new details.

How does it look like

...

That's it. This change will give us better understanding of the model's strengths and weaknesses regarding different languages or info for investigating why one user got different score for the same model than other user.

…l test above the benchmark summary.

CLAassistant · 2025-04-11T13:43:13Z

All committers have signed the CLA.

Mushoz · 2025-04-11T14:19:56Z

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

Michal-Mikolas · 2025-04-11T14:39:25Z

Really useful change! In my humble opinion, it would be awesome if it could also show stats on a per-language basis. In my experience, the programming ability can vary wildly from language to language. If I am looking for a strong model for coding in Python, then looking at the Go or Java tests is pretty useless.

I didn't test it, but there is this parameter in the benchmark source code:

        "--stats-languages",
        help="Only include stats for specific languages (comma separated)",

ziemkowski · 2025-04-11T22:43:27Z

@Michal-Mikolas I think the ask, which would be neat to see, would be at the end of that list of individual tests, like this perhaps...

---- breakdown ---- pass/fail timeouts syn_err user_asks malformed exhausted error lazy ind_err
     cpp/...        6/14      0        0       21        2         0         3     0    0
     go/...         3/20      0        0       15        3         0         2     0    0
     ...
     rust/...       18/3      0        0       12        5         0         2     0    0

Maybe with pass % too?

…above the benchmark summary.

Michal-Mikolas · 2025-04-18T22:35:31Z

@ziemkowski @Mushoz Ok, done.

Michal-Mikolas · 2025-05-14T08:36:15Z

Any feedback @paul-gauthier ?

Benchmark: Improved stats, now also printing stats for each individua…

6fe4e04

…l test above the benchmark summary.

Benchmark: Improved stats, now also printing stats for each language …

7928820

…above the benchmark summary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark: Improved stats, now also printing stats for each individual test #3780

Benchmark: Improved stats, now also printing stats for each individual test #3780

Uh oh!

Michal-Mikolas commented Apr 11, 2025

Uh oh!

CLAassistant commented Apr 11, 2025 •

edited

Loading

Uh oh!

Mushoz commented Apr 11, 2025

Uh oh!

Michal-Mikolas commented Apr 11, 2025

Uh oh!

ziemkowski commented Apr 11, 2025 •

edited

Loading

Uh oh!

Michal-Mikolas commented Apr 18, 2025 •

edited

Loading

Uh oh!

Michal-Mikolas commented May 14, 2025

Uh oh!

Uh oh!

Benchmark: Improved stats, now also printing stats for each individual test #3780

Are you sure you want to change the base?

Benchmark: Improved stats, now also printing stats for each individual test #3780

Uh oh!

Conversation

Michal-Mikolas commented Apr 11, 2025

How does it look like

Uh oh!

CLAassistant commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mushoz commented Apr 11, 2025

Uh oh!

Michal-Mikolas commented Apr 11, 2025

Uh oh!

ziemkowski commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Michal-Mikolas commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Michal-Mikolas commented May 14, 2025

Uh oh!

Uh oh!

CLAassistant commented Apr 11, 2025 •

edited

Loading

ziemkowski commented Apr 11, 2025 •

edited

Loading

Michal-Mikolas commented Apr 18, 2025 •

edited

Loading