Skip to content

Commit e05175b

Browse files
committed
chore: update eval
1 parent 441654a commit e05175b

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -298,14 +298,14 @@ Plain `gemini-2.0-flash` can be run by setting `tokenBudget` to zero, skipping t
298298

299299
It should not be surprised that plain `gemini-2.0-flash` has a 0% pass rate, as I intentionally filtered out the questions that LLMs can answer.
300300

301-
| Metric | gemini-2.0-flash | #5e80ed4 | #3deee87 (latest) |
302-
|--------|------------------|-------------------------------------------------|--------|
303-
| Pass Rate | 0% | 60% | 75% |
304-
| Average Steps | 1 | 5 |5 |
305-
| Maximum Steps | 1 | 13 |13 |
306-
| Minimum Steps | 1 | 2 |1 |
307-
| Median Steps | 1 | 3 |3 |
308-
| Average Tokens | 428 | 59,408 |32,392 |
309-
| Median Tokens | 434 | 16,001 |9,172 |
310-
| Maximum Tokens | 463 | 347,222 |202,055 |
311-
| Minimum Tokens | 374 | 5,594 |3,236 |
301+
| Metric | gemini-2.0-flash | #18f0312 |
302+
|--------|------------------|-----------|
303+
| Pass Rate | 0% | 75% |
304+
| Average Steps | 1 | 4 |
305+
| Maximum Steps | 1 | 14 |
306+
| Minimum Steps | 1 | 0 |
307+
| Median Steps | 1 | 3 |
308+
| Average Tokens | 428 | 71,285 |
309+
| Median Tokens | 434 | 22,771 |
310+
| Maximum Tokens | 463 | 536,148 |
311+
| Minimum Tokens | 374 | 0 |

0 commit comments

Comments
 (0)