Skip to content

Commit 2cd29d6

Browse files
committed
Update Readme
1 parent 6f0ac7b commit 2cd29d6

1 file changed

Lines changed: 19 additions & 19 deletions

File tree

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,26 @@ Because ground-truth code is provided, KramaBench can evaluate both the **qualit
88
### Systems leaderboard
99
Overall evaluation results by domain for KramaBench:
1010

11-
| Variant | Model | Archaeology | Astronomy | Biomedical | Environment | Legal | Wildfire | **Overall** |
11+
| Variant | Model | Archaeology | Astronomy | Biomedical | Environment | Legal | Wildfire | **Overall** |
1212
| ----------------------------- | ---------------- | ----------: | --------: | ---------: | ----------: | ------: | -------: | ----------: |
13-
| **Naive** | GPT-o3 | 25 % | 1.73 % | 3.50 % | 1.35 % | 3.35 % | 24.87 % | **9.64 %** |
14-
| | GPT-4o | 0.00 % | 1.41 % | 1.98 % | 0.45 % | 1.46 % | 1.45 % | **1.62 %** |
15-
| | Claude-3.5 | 16.67 % | 1.62 % | 2.87 % | 1.17 % | 7.33 % | 13.63 % | **7.45 %** |
16-
| | Llama-3 Instruct | 0.00 % | 1.43 % | 1.70 % | 0.98 % | 1.37 % | 1.44 % | **1.19 %** |
17-
| | DeepSeek-R1 | 0.00 % | 1.50 % | 2.49 % | 2.60 % | 1.61 % | 6.46 % | **3.14 %** |
18-
| | Qwen-2 Coder | 0.00 % | 1.37 % | 2.02 % | 1.07 % | 1.44 % | 13.68 % | **3.72 %** |
19-
| **DS-GURU (simple)** | GPT-o3 | 25 % | 3.00 % | 8.63 % | 7.66 % | 19.15 % | 45.95 % | **20.80 %** |
20-
| | GPT-4o | 8.33 % | 1.40 % | 9.38 % | 2.60 % | 2.74 % | 19.39 % | **7.61 %** |
21-
| | Claude-3.5 | 0.00 % | 4.15 % | 2.15 % | 6.21 % | 6.68 % | 34.99 % | **10.85 %** |
22-
| | Llama-3 Instruct | 0.00 % | 1.42 % | 10.38 % | 0.98 % | 5.48 % | 9.81 % | **4.81 %** |
23-
| | DeepSeek-R1 | 0.00 % | 1.57 % | 3.39 % | 2.60 % | 8.30 % | 14.81 % | **6.35 %** |
24-
| | Qwen-2 Coder | 0.00 % | 1.36 % | 2.22 % | 12.59 % | 1.15 % | 16.48 % | **6.43 %** |
25-
| **DS-GURU (self-correcting)** | GPT-o3 | 25 % | 3.53 % | 8.95 % | 19.60 % | 13.89 % | 50.73 % | **22.08 %** |
26-
| | GPT-4o | 16.67 % | 2.76 % | 8.97 % | 2.60 % | 2.80 % | 17.18 % | **8.28 %** |
27-
| | Claude-3.5 | 16.67 % | 1.52 % | 1.96 % | 11.21 % | 7.01 % | 39.16 % | **14.35 %** |
28-
| | Llama-3 Instruct | 0.00 % | 1.35 % | 6.98 % | 0.93 % | 2.15 % | 14.49 % | **4.48 %** |
29-
| | DeepSeek-R1 | 8.33 % | 2.64 % | 2.87 % | 19.08 % | 8.39 % | 30.29 % | **6.34 %** |
30-
| | Qwen-2 Coder | 8.33 % | 2.40 % | 4.35 % | 12.64 % | 9.06 % | 16.48 % | **9.98 %** |
13+
| **Naive** | GPT-o3 | 25% | 1.73% | 3.50% | 1.35% | 3.35 % | 24.87 % | **9.64 %** |
14+
| | GPT-4o | 0.00% | 1.41% | 1.98% | 0.45% | 1.46 % | 1.45 % | **1.62 %** |
15+
| | Claude-3.5 | 16.67% | 1.62% | 2.87% | 1.17% | 7.33 % | 13.63 % | **7.45 %** |
16+
| | Llama-3 Instruct | 0.00% | 1.43% | 1.70% | 0.98% | 1.37 % | 1.44 % | **1.19 %** |
17+
| | DeepSeek-R1 | 0.00% | 1.50% | 2.49% | 2.60% | 1.61 % | 6.46 % | **3.14 %** |
18+
| | Qwen-2 Coder | 0.00% | 1.37% | 2.02% | 1.07% | 1.44 % | 13.68 % | **3.72 %** |
19+
| **DS-GURU (simple)** | GPT-o3 | 25% | 3.00% | 8.63% | 7.66% | 19.15 % | 45.95 % | **20.80 %** |
20+
| | GPT-4o | 8.33% | 1.40% | 9.38% | 2.60% | 2.74 % | 19.39 % | **7.61 %** |
21+
| | Claude-3.5 | 0.00% | 4.15% | 2.15% | 6.21% | 6.68 % | 34.99 % | **10.85 %** |
22+
| | Llama-3 Instruct | 0.00% | 1.42% | 10.38% | 0.98% | 5.48 % | 9.81 % | **4.81 %** |
23+
| | DeepSeek-R1 | 0.00% | 1.57% | 3.39% | 2.60% | 8.30 % | 14.81 % | **6.35 %** |
24+
| | Qwen-2 Coder | 0.00% | 1.36% | 2.22% | 12.59% | 1.15 % | 16.48 % | **6.43 %** |
25+
| **DS-GURU (self-correcting)** | GPT-o3 | 25% | 3.53% | 8.95% | 19.60% | 13.89 % | 50.73 % | **22.08 %** |
26+
| | GPT-4o | 16.67% | 2.76% | 8.97% | 2.60% | 2.80 % | 17.18 % | **8.28 %** |
27+
| | Claude-3.5 | 16.67% | 1.52% | 1.96% | 11.21% | 7.01 % | 39.16 % | **14.35 %** |
28+
| | Llama-3 Instruct | 0.00% | 1.35% | 6.98% | 0.93% | 2.15 % | 14.49 % | **4.48 %** |
29+
| | DeepSeek-R1 | 8.33% | 2.64% | 2.87% | 19.08% | 8.39 % | 30.29 % | **6.34 %** |
30+
| | Qwen-2 Coder | 8.33% | 2.40% | 4.35% | 12.64% | 9.06 % | 16.48 % | **9.98 %** |
3131

3232

3333
## Breakdown of tasks per domain

0 commit comments

Comments
 (0)