Skip to content

Commit 1ef287a

Browse files
committed
Updated website to CRC scores and added pipeline logic
1 parent 6c64cfb commit 1ef287a

10 files changed

Lines changed: 1280 additions & 52 deletions

.github/copilot-instructions.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# COPILOT EDITS OPERATIONAL GUIDELINES
2+
3+
### GENERAL INSTRUCTIONS
4+
- Use snake_case for variable and function names.
5+
- Use CamelCase for class names. Follow PEP 8 style guidelines.
6+
- Include type hints for function parameters and return types.
7+
- Write docstrings for all public modules, classes, functions, and methods.
8+
- Prefer using NumPy for numerical computations. Use vectorized operations instead of loops where possible.
9+
- Import NumPy using the alias 'np'. Include comments explaining complex mathematical operations.
10+
- Do **NOT** generate needless code or boilerplate.
11+
- Do **NOT** generate functions that are used only once. Instead, inline the code if it is not reused.
12+
- Do **NOT**, for ANY REASON, generate nested/inner functions, that means functions defined inside other functions. Always define functions either at the module level or as methods of a class.
13+
14+
### MANDATORY PLANNING PHASE
15+
When working with large files (>300 lines) or complex changes:
16+
1. ALWAYS start by creating a detailed plan BEFORE making any edits
17+
2. Your plan MUST include:
18+
- All functions/sections that need modification
19+
- The order in which changes should be applied
20+
- Dependencies between changes
21+
- Estimated number of separate edits required
22+
23+
3. Format your plan as:
24+
25+
## PROPOSED EDIT PLAN
26+
Working with: [filename]
27+
Total planned edits: [number]
28+
29+
### MAKING EDITS
30+
- Focus on one conceptual change at a time
31+
- Show clear "before" and "after" snippets when proposing changes
32+
- Include concise explanations of what changed and why
33+
- Always check if the edit maintains the project's coding style
34+
35+
### Edit sequence:
36+
1. [First specific change] - Purpose: [why]
37+
2. [Second specific change] - Purpose: [why]
38+
3. Do you approve this plan? I'll proceed with Edit [number] after your confirmation.
39+
4. WAIT for explicit user confirmation before making ANY edits when user ok edit [number]
40+
41+
### EXECUTION PHASE
42+
- After each individual edit, clearly indicate progress:
43+
"✅ Completed edit [#] of [total]. Ready for next edit?"
44+
- If you discover additional needed changes during editing:
45+
- STOP and update the plan
46+
- Get approval before continuing
47+
48+
### REFACTORING GUIDANCE
49+
When refactoring large files:
50+
- Break work into logical, independently functional chunks
51+
- Ensure each intermediate state maintains functionality
52+
- Consider temporary duplication as a valid interim step
53+
- Always indicate the refactoring pattern being applied

data/benchmark_oracle.csv

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,22 @@
1-
System,Models,Archaeology,Astronomy,Biomedical,Environment,Legal,Wildfire,Overall
2-
DS-Guru no-context,GPT-o3,17.83%,12.93%,19.48%,19.17%,9.94%,16.13%,14.93%
3-
DS-Guru no-context,GPT-4o,15.09%,9.15%,12.16%,11.26%,8.88%,7.15%,10.05%
4-
DS-Guru no-context,Claude-3.5,16.52%,10.63%,9.87%,12.51%,9.80%,0.00%,11.63%
5-
DS-Guru no-context,Llama3-3Instruct,14.44%,12.17%,10.24%,10.35%,8.20%,8.06%,9.93%
6-
DS-Guru no-context,DeepSeek-R1,18.79%,8.53%,8.25%,12.71%,11.39%,8.90%,11.56%
7-
DS-Guru no-context,Qwen2-5Coder,10.24%,6.74%,7.71%,7.14%,1.52%,4.53%,6.62%
8-
DS-Guru one-shot,GPT-o3,23.90%,21.14%,18.29%,28.48%,18.49%,25.08%,22.85%
9-
DS-Guru one-shot,GPT-4o,14.26%,10.58%,9.38%,20.37%,10.96%,19.21%,14.86%
10-
DS-Guru one-shot,Claude-3.5,17.07%,10.24%,9.44%,22.27%,11.47%,17.93%,15.48%
11-
DS-Guru one-shot,Llama3-3Instruct,8.92%,10.44%,4.45%,12.44%,8.64%,12.90%,10.23%
12-
DS-Guru one-shot,DeepSeek-R1,16.78%,15.23%,8.06%,14.23%,11.89%,9.65%,12.64%
13-
DS-Guru one-shot,Qwen2-5Coder,9.72%,11.57%,5.37%,15.13%,8.96%,13.22%,11.26%
14-
DS-Guru few-shot,GPT-o3,27.78%,23.22%,19.56%,33.67%,35.14%,32.53%,31.92%
15-
DS-Guru few-shot,GPT-4o,18.97%,19.29%,12.51%,27.14%,25.23%,26.07%,23.60%
16-
DS-Guru few-shot,Claude-3.5,16.24%,14.02%,14.80%,33.83%,26.36%,25.02%,24.22%
17-
DS-Guru few-shot,Llama3-3Instruct,15.57%,13.85%,11.63%,19.37%,15.57%,21.56%,17.11%
18-
DS-Guru few-shot,DeepSeek-R1,22.29%,10.79%,9.65%,15.45%,11.75%,10.76%,13.37%
19-
DS-Guru few-shot,Qwen2-5Coder,11.83%,14.91%,7.51%,18.39%,13.70%,18.51%,15.15%
20-
smolagents DR,GPT-o3,41.67%,25%,44.44%,45%,44.83%,47.62%,44.45%
21-
smolagents DR,GPT-4o,25%,25%,22.22%,20%,56.67%,38.1%,39%
22-
smolagents DR,Claude-3.5,16.67%,25%,33.33%,25%,66.66%,66.66%,47%
23-
smolagents DR,Claude-3-7,41.67%,33.33%,77.78%,80%,63.33%,71.43%,59%
1+
System,Models,Archaeology,Astronomy,Biomedical,Environment,Legal,Wildfire,Overall,Overall Benchmark Time
2+
DS-Guru no-context,GPT-o3,11.11% ± 3.93%,10.00% ± 8.16%,11.11% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,10.17% ± 0.92%,5.36% ± 1.20%,01:39:18 ± 00:05:32
3+
DS-Guru no-context,GPT-4o,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,01:40:38 ± 00:11:55
4+
DS-Guru no-context,Llama3-3Instruct,5.56% ± 3.93%,0.00% ± 0.00%,11.11% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,1.59% ± 2.24%,1.96% ± 0.80%,01:43:49 ± 00:04:19
5+
DS-Guru no-context,DeepSeek-R1,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,01:22:53 ± 00:12:52
6+
DS-Guru no-context,Claude-3-7,4.17% ± 4.17%,10.00% ± 0.00%,0.00% ± 0.00%,1.58% ± 1.58%,1.67% ± 1.67%,8.96% ± 0.57%,4.11% ± 0.79%,01:08:14 ± 00:03:03
7+
DS-Guru one-shot,GPT-o3,8.33% ± 6.80%,16.67% ± 4.71%,7.51% ± 5.31%,23.89% ± 6.71%,15.56% ± 1.57%,42.80% ± 3.48%,21.35% ± 0.84%,01:53:26 ± 00:04:26
8+
DS-Guru one-shot,GPT-4o,11.11% ± 7.86%,13.33% ± 4.71%,5.19% ± 7.33%,12.33% ± 3.30%,6.67% ± 0.00%,19.85% ± 3.61%,11.54% ± 1.47%,02:54:42 ± 00:35:58
9+
DS-Guru one-shot,Llama3-3Instruct,0.00% ± 0.00%,6.67% ± 4.71%,7.41% ± 5.24%,0.00% ± 0.00%,0.00% ± 0.00%,26.37% ± 5.02%,6.74% ± 1.49%,03:07:22 ± 00:18:50
10+
DS-Guru one-shot,DeepSeek-R1,2.78% ± 3.93%,6.67% ± 4.71%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.98% ± 0.80%,01:38:48 ± 00:18:35
11+
DS-Guru one-shot,Claude-3-7,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,6.67% ± 1.67%,3.33% ± 3.33%,4.76% ± 0.00%,3.27% ± 1.31%,01:09:43 ± 00:00:11
12+
DS-Guru few-shot,GPT-o3,16.67% ± 0.00%,26.67% ± 4.71%,7.41% ± 5.24%,50.11% ± 4.09%,61.03% ± 5.78%,52.02% ± 3.79%,43.71% ± 1.94%,06:54:04 ± 00:44:05
13+
DS-Guru few-shot,GPT-4o,13.89% ± 3.93%,20.00% ± 0.00%,3.70% ± 5.24%,18.44% ± 6.00%,37.78% ± 4.16%,36.67% ± 3.53%,26.20% ± 2.34%,05:43:33 ± 00:43:47
14+
DS-Guru few-shot,Llama3-3Instruct,13.89% ± 3.93%,10.00% ± 0.00%,11.11% ± 0.00%,0.00% ± 0.00%,6.67% ± 0.00%,29.72% ± 1.27%,11.67% ± 0.65%,03:32:53 ± 00:08:50
15+
DS-Guru few-shot,DeepSeek-R1,5.56% ± 3.93%,6.67% ± 4.71%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,1.31% ± 0.46%,01:49:34 ± 00:08:24
16+
DS-Guru few-shot,Claude-3-7,20.83% ± 4.17%,25.00% ± 15.00%,0.00% ± 0.00%,9.17% ± 9.17%,31.67% ± 1.67%,18.17% ± 5.20%,19.75% ± 3.36%,02:01:40 ± 00:20:03
17+
smolagents DR,GPT-o3,27.78% ± 7.86%,26.67% ± 4.71%,37.04% ± 5.24%,24.00% ± 0.00%,17.78% ± 5.67%,32.89% ± 10.31%,25.86% ± 4.57%,06:22:02 ± 01:32:03
18+
smolagents DR,Claude-3-7,45.83% ± 12.50%,65.00% ± 5.00%,58.33% ± 2.78%,46.83% ± 1.50%,65.00% ± 1.67%,75.08% ± 3.78%,60.67% ± 1.23%,06:54:31 ± 00:21:03
19+
smolagents Reflexion,GPT-o3,30.56% ± 3.93%,40.00% ± 0.00%,44.44% ± 0.00%,31.44% ± 2.20%,22.22% ± 8.31%,39.78% ± 6.14%,32.33% ± 4.44%,08:11:30 ± 03:32:21
20+
smolagents Reflexion,Claude-3-7,37.50% ± 4.17%,50.00% ± 0.00%,83.33% ± 5.56%,62.67% ± 4.33%,70.00% ± 10.00%,64.45% ± 1.39%,62.81% ± 3.10%,12:05:14 ± 00:29:43
21+
smolagents PDT,GPT-o3,8.33% ± 6.80%,0.00% ± 0.00%,0.00% ± 0.00%,3.26% ± 1.30%,8.62% ± 3.88%,15.86% ± 5.53%,7.69% ± 2.69%,12:07:07 ± 01:49:02
22+
smolagents PDT,Claude-3-7,16.67% ± 0.00%,10.00% ± 0.00%,11.11% ± 0.00%,11.50% ± 3.50%,9.24% ± 0.35%,25.50% ± 5.46%,14.38% ± 2.15%,12:06:38 ± 00:17:29

data/benchmark_results.csv

Lines changed: 22 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,22 @@
1-
System,Models,Archaeology,Astronomy,Biomedical,Environment,Legal,Wildfire,Overall
2-
DS-Guru no-context,GPT-o3,25%,1.73%,3.50%,1.35%,3.35%,24.87%,9.64%
3-
DS-Guru no-context,GPT-4o,0.00%,1.41%,1.98%,0.45%,1.46%,1.45%,1.62%
4-
DS-Guru no-context,Claude-3.5,16.67%,1.62%,2.87%,1.17%,7.33%,13.63%,7.45%
5-
DS-Guru no-context,Llama3-3Instruct,0.00%,1.43%,1.70%,0.98%,1.37%,1.44%,1.19%
6-
DS-Guru no-context,DeepSeek-R1,0.00%,1.50%,2.49%,2.60%,1.61%,6.46%,3.14%
7-
DS-Guru no-context,Qwen2-5Coder,0.00%,1.37%,2.02%,1.07%,1.44%,13.68%,3.72%
8-
DS-Guru one-shot,GPT-o3,25%,3.00%,8.63%,7.66%,19.15%,45.95%,20.80%
9-
DS-Guru one-shot,GPT-4o,8.33%,1.40%,9.38%,2.60%,2.74%,19.39%,7.61%
10-
DS-Guru one-shot,Claude-3.5,0.00%,4.15%,2.15%,6.21%,6.68%,34.99%,10.85%
11-
DS-Guru one-shot,Llama3-3Instruct,0.00%,1.42%,10.38%,0.98%,5.48%,9.81%,4.81%
12-
DS-Guru one-shot,DeepSeek-R1,0.00%,1.57%,3.39%,2.60%,8.30%,14.81%,6.35%
13-
DS-Guru one-shot,Qwen2-5Coder,0.00%,1.36%,2.22%,12.59%,1.15%,16.48%,6.43%
14-
DS-Guru few-shot,GPT-o3,25%,3.53%,8.95%,19.6%,13.89%,50.73%,22.08%
15-
DS-Guru few-shot,GPT-4o,16.67%,2.76%,8.97%,2.60%,2.80%,17.18%,8.28%
16-
DS-Guru few-shot,Claude-3.5,16.67%,1.52%,1.96%,11.21%,7.01%,39.16%,14.35%
17-
DS-Guru few-shot,Llama3-3Instruct,0.00%,1.35%,6.98%,0.93%,2.15%,14.49%,4.48%
18-
DS-Guru few-shot,DeepSeek-R1,8.33%,2.64%,2.87%,19.08%,8.39%,30.29%,6.34%
19-
DS-Guru few-shot,Qwen2-5Coder,8.33%,2.40%,4.35%,12.64%,9.06%,16.48%,9.98%
20-
smolagents DR,GPT-o3,41.67%,16.67%,33.33%,50%,50%,38.1%,41.36%
21-
smolagents DR,GPT-4o,33.33%,0.00%,11.11%,35%,40%,38.1%,30.77%
22-
smolagents DR,Claude-3-5,33.33%,0.00%,22.22%,60%,46.67%,52.38%,41.35%
23-
smolagents DR,Claude-3-7,33.33%,16.67%,44.44%,60%,63.33%,52.38%,50%
24-
openAI, Deep Research*, 40.00%, 33.33%, 44.45%, 61.67%, 50.00%, 67.28%, 52.18%
25-
Google, Gemini 2.5 Pro*, 25.00%, 16.67%, 33.33%, 25.00%, 13.33%, 24.87%, 18.48%
1+
System,Models,Archaeology,Astronomy,Biomedical,Environment,Legal,Wildfire,Overall,Overall Benchmark Time
2+
DS-Guru no-context,GPT-o3,16.67% ± 0.00%,0.00% ± 0.00%,0.10% ± 0.14%,5.00% ± 4.08%,0.00% ± 0.00%,14.18% ± 4.28%,5.87% ± 0.71%,00:59:19 ± 00:06:11
3+
DS-Guru no-context,GPT-4o,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,00:37:13 ± 00:02:17
4+
DS-Guru no-context,Llama3-3Instruct,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,1.59% ± 2.24%,0.33% ± 0.46%,00:33:48 ± 00:03:44
5+
DS-Guru no-context,DeepSeek-R1,2.78% ± 3.93%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.33% ± 0.46%,01:32:46 ± 00:06:56
6+
DS-Guru no-context,Claude-3-7,2.78% ± 3.93%,3.33% ± 4.71%,0.00% ± 0.00%,0.00% ± 0.00%,1.11% ± 1.57%,4.38% ± 3.92%,1.88% ± 0.81%,01:27:25 ± 00:05:33
7+
DS-Guru one-shot,GPT-o3,19.44% ± 3.93%,0.00% ± 0.00%,0.53% ± 0.75%,16.11% ± 9.26%,10.00% ± 0.00%,40.01% ± 4.22%,16.67% ± 2.93%,01:08:41 ± 00:07:54
8+
DS-Guru one-shot,GPT-4o,8.33% ± 6.80%,6.67% ± 4.71%,3.70% ± 0.00%,0.00% ± 0.00%,4.41% ± 2.17%,13.69% ± 3.06%,6.08% ± 1.25%,00:37:44 ± 00:02:06
9+
DS-Guru one-shot,Llama3-3Instruct,16.67% ± 0.00%,0.00% ± 0.00%,0.10% ± 0.14%,0.00% ± 0.00%,0.74% ± 0.52%,5.97% ± 2.55%,3.42% ± 0.37%,01:35:03 ± 00:10:40
10+
DS-Guru one-shot,DeepSeek-R1,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,01:42:04 ± 00:04:31
11+
DS-Guru one-shot,Claude-3-7,2.78% ± 3.93%,0.00% ± 0.00%,0.00% ± 0.00%,3.33% ± 2.36%,0.00% ± 0.00%,1.59% ± 2.24%,1.31% ± 1.22%,02:46:08 ± 00:35:45
12+
DS-Guru few-shot,GPT-o3,13.89% ± 3.93%,0.00% ± 0.00%,0.10% ± 0.14%,49.56% ± 0.87%,9.26% ± 2.92%,52.92% ± 0.72%,24.98% ± 1.25%,02:24:55 ± 00:40:56
13+
DS-Guru few-shot,GPT-4o,13.89% ± 3.93%,3.33% ± 4.71%,0.00% ± 0.00%,0.00% ± 0.00%,4.84% ± 3.16%,25.69% ± 1.15%,8.67% ± 1.48%,01:10:24 ± 00:17:30
14+
DS-Guru few-shot,Llama3-3Instruct,22.22% ± 3.93%,0.00% ± 0.00%,0.20% ± 0.14%,0.00% ± 0.00%,3.33% ± 0.00%,23.25% ± 3.06%,8.40% ± 1.01%,02:05:10 ± 00:20:17
15+
DS-Guru few-shot,DeepSeek-R1,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,0.00% ± 0.00%,1.59% ± 2.24%,0.33% ± 0.46%,01:51:50 ± 00:11:38
16+
DS-Guru few-shot,Claude-3-7,5.56% ± 3.93%,0.00% ± 0.00%,0.00% ± 0.00%,15.00% ± 4.08%,8.89% ± 6.29%,10.09% ± 9.67%,8.29% ± 1.12%,04:06:39 ± 00:43:44
17+
smolagents DR,GPT-o3,33.33% ± 11.79%,23.33% ± 9.43%,37.04% ± 10.48%,31.33% ± 8.96%,17.78% ± 8.31%,35.23% ± 11.46%,28.07% ± 8.80%,11:25:19 ± 00:41:34
18+
smolagents DR,Claude-3-7,44.44% ± 3.93%,50.00% ± 8.16%,38.89% ± 7.86%,60.56% ± 7.74%,61.23% ± 12.70%,60.16% ± 7.43%,55.83% ± 3.41%,08:20:56 ± 00:51:00
19+
smolagents Reflexion,GPT-o3,30.56% ± 10.39%,46.67% ± 4.71%,29.63% ± 5.24%,23.89% ± 14.55%,26.67% ± 14.40%,35.04% ± 8.36%,30.53% ± 10.79%,15:16:55 ± 07:22:44
20+
smolagents Reflexion,Claude-3-7,27.78% ± 19.64%,30.00% ± 21.60%,33.33% ± 32.71%,37.50% ± 27.46%,38.89% ± 27.53%,43.58% ± 30.83%,36.91% ± 26.25%,07:56:16 ± 05:12:38
21+
smolagents PDT,GPT-o3,8.33% ± 6.80%,0.00% ± 0.00%,0.00% ± 0.00%,1.42% ± 1.09%,4.51% ± 0.84%,22.01% ± 1.39%,7.57% ± 1.28%,13:10:22 ± 00:55:26
22+
smolagents PDT,Claude-3-7,19.44% ± 7.86%,6.67% ± 4.71%,5.56% ± 5.56%,9.84% ± 2.86%,6.25% ± 4.95%,22.58% ± 4.05%,12.01% ± 1.03%,14:42:23 ± 00:25:45

data/pipeline_design_scores.csv

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
System,Models,Overall
2+
DS-Guru one-shot,GPT-o3,41.71% ± 1.00%
3+
DS-Guru no-context,GPT-o3,39.75% ± 0.73%
4+
DS-Guru few-shot,GPT-o3,37.99% ± 2.64%
5+
DS-Guru no-context,GPT-4o,29.73% ± 1.16%
6+
smolagents Reflexion,GPT-o3,24.43% ± 7.09%
7+
DS-Guru no-context,Llama3-3Instruct,21.80% ± 1.30%
8+
smolagents DR,GPT-o3,21.39% ± 1.16%
9+
DS-Guru one-shot,GPT-4o,21.26% ± 0.64%
10+
DS-Guru few-shot,GPT-4o,20.31% ± 0.07%
11+
DS-Guru one-shot,Llama3-3Instruct,15.53% ± 0.73%
12+
DS-Guru few-shot,Llama3-3Instruct,12.81% ± 1.04%
13+
smolagents PDT,GPT-o3,5.46% ± 1.17%
14+
DS-Guru one-shot,DeepSeek-R1,1.59% ± 0.77%
15+
DS-Guru few-shot,DeepSeek-R1,1.11% ± 0.72%
16+
DS-Guru no-context,DeepSeek-R1,0.63% ± 0.47%
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
System,Models,Overall
2+
smolagents Reflexion,GPT-o3,24.94% ± 3.28%
3+
smolagents DR,GPT-o3,22.55% ± 2.52%
4+
DS-Guru few-shot,GPT-o3,22.05% ± 0.71%
5+
DS-Guru one-shot,GPT-o3,17.33% ± 0.48%
6+
DS-Guru no-context,GPT-o3,10.69% ± 0.35%
7+
smolagents PDT,GPT-o3,9.52% ± 0.49%
8+
DS-Guru few-shot,GPT-4o,6.60% ± 0.29%
9+
DS-Guru few-shot,DeepSeek-R1,5.79% ± 0.72%
10+
DS-Guru few-shot,Llama3-3Instruct,5.53% ± 0.49%
11+
DS-Guru one-shot,GPT-4o,5.08% ± 0.57%
12+
DS-Guru one-shot,DeepSeek-R1,3.91% ± 0.44%
13+
DS-Guru no-context,Llama3-3Instruct,3.88% ± 0.25%
14+
DS-Guru no-context,GPT-4o,3.32% ± 0.54%
15+
DS-Guru one-shot,Llama3-3Instruct,2.80% ± 0.08%
16+
DS-Guru no-context,DeepSeek-R1,1.48% ± 0.08%

index.html

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ <h3>KramaBench</h3>
1515
<ul>
1616
<li><a href="#about">About</a></li>
1717
<li><a href="#leaderboard">Leaderboard</a></li>
18+
<li><a href="#pipeline-scores">Pipeline Scores</a></li>
1819
<li><a href="#name-origin">The name KramaBench</a></li>
1920
<li><a href="#contribute">Submit your results</a></li>
2021
<li><a href="#resources">Resources</a></li>
@@ -103,13 +104,51 @@ <h2>Current Rankings</h2>
103104
<th>System</th>
104105
<th>Model</th>
105106
<th>Score (%)</th>
107+
<th>Overall Benchmark Time</th>
106108
</tr>
107109
</thead>
108110
<tbody></tbody>
109111
</table>
110112
</div>
111113
</section>
112114

115+
<!-- Pipeline Scores Section -->
116+
<section id="pipeline-scores" class="leaderboard-section pipeline-scores-section">
117+
<h2>Pipeline Scores</h2>
118+
<div class="pipeline-table-group">
119+
<h3>Pipeline Design</h3>
120+
<div class="table-container">
121+
<table id="pipeline-design-table" class="scores-table">
122+
<thead>
123+
<tr>
124+
<th>Rank</th>
125+
<th>System</th>
126+
<th>Model</th>
127+
<th>Overall Score (%)</th>
128+
</tr>
129+
</thead>
130+
<tbody></tbody>
131+
</table>
132+
</div>
133+
</div>
134+
<div class="pipeline-table-group">
135+
<h3>Pipeline Implementation</h3>
136+
<div class="table-container">
137+
<table id="pipeline-implementation-table" class="scores-table">
138+
<thead>
139+
<tr>
140+
<th>Rank</th>
141+
<th>System</th>
142+
<th>Model</th>
143+
<th>Overall Score (%)</th>
144+
</tr>
145+
</thead>
146+
<tbody></tbody>
147+
</table>
148+
</div>
149+
</div>
150+
</section>
151+
113152

114153
<!-- Name Origin Section -->
115154
<section id="name-origin" class="name-origin-section">

0 commit comments

Comments
 (0)