Skip to content

Commit 7780943

Browse files
committed
docs: highlight benchmark dashboard and links
1 parent c584dfe commit 7780943

3 files changed

Lines changed: 125 additions & 37 deletions

File tree

README.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,41 @@
44
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
55
[![BixBench-Verified-50](https://img.shields.io/badge/benchmark-BixBench--Verified--50-green.svg)](https://huggingface.co/datasets/phylobio/BixBench-Verified-50)
66
[![BioAgents](https://img.shields.io/badge/backend-BioAgents-purple.svg)](https://github.com/bio-xyz/BioAgents)
7-
[![Get API Key](https://img.shields.io/badge/API%20Key-chat.bio.xyz-teal.svg)](https://chat.bio.xyz/)
7+
[![Get API Key](https://img.shields.io/badge/API%20Key-chat.bio.xyz-teal.svg)](https://chat.bio.xyz/chat?settings=account&section=api-keys)
88

99
BixBench evaluation harness for [BioAgents](https://github.com/bio-xyz/BioAgents) and its closed-source literature and data-analysis agents.
1010

11+
### Overall highlights (from dashboard)
12+
13+
| Mode | Accuracy | 95% Wilson CI | Correct / Total |
14+
| ----------------------- | -------: | ------------: | --------------: |
15+
| Direct | 71.3% | 67.4% - 74.9% | 392 / 550 |
16+
| MCQ with refusal | 85.1% | 81.9% - 87.8% | 468 / 550 |
17+
| **MCQ without refusal** | 90.0% | 87.2% - 92.2% | 495 / 550 |
18+
19+
| Headline | Value |
20+
| --------------------------------------------------------- | ------------------: |
21+
| MCQ lift (Direct -> **MCQ without refusal**) | +18.7pp |
22+
| Refusal gap (**MCQ without refusal** -> MCQ with refusal) | -4.9pp |
23+
| Best repeat (**MCQ without refusal**) | 96.0% (`085809-r3`) |
24+
| Task groups at 100% | 22 / 32 (68.8%) |
25+
26+
### Showcase: Benchmark Dashboard
27+
28+
Interactive page from `docs/`:
29+
30+
- Live (GitHub Pages): [bio-xyz.github.io/bio-benchmark](https://bio-xyz.github.io/bio-benchmark/)
31+
- Source: [`docs/index.html`](docs/index.html)
32+
33+
[![Benchmark dashboard preview](docs/assets/performance_analysis.png)](https://bio-xyz.github.io/bio-benchmark/)
34+
35+
Local preview:
36+
37+
```bash
38+
python3 -m http.server 8080 --directory docs
39+
# open http://localhost:8080
40+
```
41+
1142
### Minimal config
1243

1344
Only these fields are required:
@@ -76,7 +107,5 @@ python3 scripts/build_pages_data.py
76107
This script updates:
77108

78109
- `docs/data/benchmark_summary.json`
79-
- `docs/data/results/*.csv` (only the 3 scoped CSV files)
110+
- `docs/data/results/*.csv`
80111
- `docs/assets/performance_analysis.png`
81-
82-
GitHub Pages deploys automatically via `.github/workflows/pages.yml` on pushes to `main` that touch `docs/` or relevant source result files. In repository settings, set **Pages -> Build and deployment -> Source** to **GitHub Actions**.

docs/index.html

Lines changed: 71 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,21 @@
33
<head>
44
<meta charset="UTF-8" />
55
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6-
<title>PhyloBioBixBench-Verified-50 | Bio Benchmark Results</title>
6+
<title>BIOS Benchmark Results</title>
77
<meta
88
name="description"
9-
content="Interactive dashboard for PhyloBioBixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
9+
content="Interactive dashboard for BixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
1010
/>
1111

1212
<!-- Favicon -->
1313
<link rel="icon" type="image/svg+xml" href="assets/favicon.svg" />
1414

1515
<!-- Open Graph -->
1616
<meta property="og:type" content="website" />
17-
<meta
18-
property="og:title"
19-
content="PhyloBioBixBench-Verified-50 | Bio Benchmark"
20-
/>
17+
<meta property="og:title" content="BIOS Benchmark Results" />
2118
<meta
2219
property="og:description"
23-
content="Interactive dashboard for PhyloBioBixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
20+
content="Interactive dashboard for BixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
2421
/>
2522
<meta
2623
property="og:url"
@@ -30,13 +27,10 @@
3027

3128
<!-- Twitter Card -->
3229
<meta name="twitter:card" content="summary" />
33-
<meta
34-
name="twitter:title"
35-
content="PhyloBioBixBench-Verified-50 | Bio Benchmark"
36-
/>
30+
<meta name="twitter:title" content="BIOS Benchmark Results" />
3731
<meta
3832
name="twitter:description"
39-
content="Interactive dashboard for PhyloBioBixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
33+
content="Interactive dashboard for BixBench-Verified-50 benchmark results — direct, MCQ, and refusal-aware grading across 50 biology questions."
4034
/>
4135

4236
<link rel="preconnect" href="https://fonts.googleapis.com" />
@@ -55,17 +49,62 @@
5549
<!-- 1. Hero -->
5650
<header class="hero">
5751
<p class="eyebrow">
58-
<a href="https://ai.bio.xyz/" class="eyebrow-link">ai.bio.xyz</a>
52+
<a
53+
href="https://ai.bio.xyz/"
54+
class="eyebrow-link"
55+
target="_blank"
56+
rel="noopener noreferrer"
57+
>ai.bio.xyz</a
58+
>
5959
</p>
60-
<h1>PhyloBioBixBench-Verified-50</h1>
60+
<h1>BixBench-Verified-50</h1>
6161
<div class="hero-meta">
6262
<span id="generated-at">Loading data...</span>
6363
<span id="totals-badge"></span>
6464
</div>
6565
<nav class="hero-links">
66-
<a href="https://ai.bio.xyz/" class="hero-link">ai.bio.xyz</a>
67-
<a href="https://github.com/bio-xyz/bio-benchmark" class="hero-link"
68-
>GitHub</a
66+
<a
67+
href="https://ai.bio.xyz/"
68+
class="hero-link"
69+
target="_blank"
70+
rel="noopener noreferrer"
71+
><span class="hero-link-icon" aria-hidden="true">
72+
<svg viewBox="0 0 24 24" focusable="false">
73+
<circle cx="12" cy="12" r="9"></circle>
74+
<path d="M3 12h18"></path>
75+
<path d="M12 3c2.8 2.4 4.5 5.6 4.5 9s-1.7 6.6-4.5 9"></path>
76+
<path d="M12 3c-2.8 2.4-4.5 5.6-4.5 9s1.7 6.6 4.5 9"></path>
77+
</svg>
78+
</span>
79+
<span class="hero-link-label">ai.bio.xyz</span></a
80+
>
81+
<a
82+
href="https://github.com/bio-xyz/bio-benchmark"
83+
class="hero-link"
84+
target="_blank"
85+
rel="noopener noreferrer"
86+
><span class="hero-link-icon" aria-hidden="true">
87+
<svg viewBox="0 0 24 24" focusable="false">
88+
<path
89+
d="M9 19c-5 1.5-5-2.5-7-3m14 6v-3.87a3.36 3.36 0 0 0-.94-2.61c3.14-.35 6.44-1.54 6.44-7A5.44 5.44 0 0 0 20 4.77 5.07 5.07 0 0 0 19.91 1S18.73.65 16 2.48a13.38 13.38 0 0 0-7 0C6.27.65 5.09 1 5.09 1A5.07 5.07 0 0 0 5 4.77a5.44 5.44 0 0 0-1.5 3.78c0 5.42 3.3 6.61 6.44 7A3.36 3.36 0 0 0 9 18.13V22"
90+
></path>
91+
</svg>
92+
</span>
93+
<span class="hero-link-label">GitHub</span></a
94+
>
95+
<a
96+
href="https://huggingface.co/datasets/phylobio/BixBench-Verified-50"
97+
class="hero-link"
98+
target="_blank"
99+
rel="noopener noreferrer"
100+
><span class="hero-link-icon" aria-hidden="true">
101+
<svg viewBox="0 0 24 24" focusable="false">
102+
<path d="M4 19h16"></path>
103+
<path d="M6 15.5 10.5 11l3.5 3.5L19 8"></path>
104+
<path d="m16 8 3-.2-.2 3"></path>
105+
</svg>
106+
</span>
107+
<span class="hero-link-label">Hugging Face</span></a
69108
>
70109
</nav>
71110
</header>
@@ -83,13 +122,19 @@ <h2>Grading Modes</h2>
83122
aria-label="Headline metrics"
84123
></section>
85124

86-
<!-- 4. Key Takeaways -->
125+
<!-- 4. Scientific Reporting -->
126+
<section class="insight-panel" id="reporting-section">
127+
<h2>Scientific Reporting</h2>
128+
<div id="reporting-content" class="reporting-grid"></div>
129+
</section>
130+
131+
<!-- 5. Key Takeaways -->
87132
<section class="insight-panel" id="takeaways-section">
88133
<h2>Key Takeaways</h2>
89134
<ol id="takeaways-list" class="takeaway-list"></ol>
90135
</section>
91136

92-
<!-- 5. Source Runs + Downloads -->
137+
<!-- 6. Source Runs + Downloads -->
93138
<section class="panel source-panel">
94139
<div>
95140
<h2>Scoped Source Runs</h2>
@@ -111,7 +156,7 @@ <h2>Downloads</h2>
111156
</div>
112157
</section>
113158

114-
<!-- 6. Performance Image -->
159+
<!-- 7. Performance Image -->
115160
<section class="panel">
116161
<h2>Existing Performance Figure</h2>
117162
<p class="panel-intro">
@@ -125,7 +170,7 @@ <h2>Existing Performance Figure</h2>
125170
/>
126171
</section>
127172

128-
<!-- 7. Charts Grid (6 existing + 1 new task group chart) -->
173+
<!-- 8. Charts Grid (6 existing + 1 new task group chart) -->
129174
<section class="charts">
130175
<article class="panel chart-panel">
131176
<h2>Overall Accuracy</h2>
@@ -179,7 +224,7 @@ <h2>Task Group Accuracy (32 groups)</h2>
179224
</article>
180225
</section>
181226

182-
<!-- 8. Strengths & Weaknesses -->
227+
<!-- 9. Strengths & Weaknesses -->
183228
<section class="insight-panel" id="sw-section">
184229
<h2>Strengths &amp; Weaknesses</h2>
185230
<div class="sw-grid">
@@ -194,7 +239,7 @@ <h3>Weaknesses</h3>
194239
</div>
195240
</section>
196241

197-
<!-- 9. All 50 Questions Table -->
242+
<!-- 10. All 50 Questions Table -->
198243
<section class="panel">
199244
<h2>All 50 Questions</h2>
200245
<div class="table-scroll">
@@ -215,7 +260,7 @@ <h2>All 50 Questions</h2>
215260
</div>
216261
</section>
217262

218-
<!-- 10. All 32 Task Groups Table -->
263+
<!-- 11. All 32 Task Groups Table -->
219264
<section class="panel">
220265
<h2>All 32 Task Groups</h2>
221266
<div class="table-scroll">
@@ -235,13 +280,13 @@ <h2>All 32 Task Groups</h2>
235280
</div>
236281
</section>
237282

238-
<!-- 11. MCQ Rescue Detail -->
283+
<!-- 12. MCQ Rescue Detail -->
239284
<section class="panel" id="rescue-detail-section">
240285
<h2>MCQ Rescue Detail</h2>
241286
<div id="rescue-detail-content"></div>
242287
</section>
243288

244-
<!-- 12. Cross-Run Variability -->
289+
<!-- 13. Cross-Run Variability -->
245290
<section class="panel" id="variability-section">
246291
<h2>Cross-Run Variability</h2>
247292
<p class="panel-intro">
@@ -256,12 +301,6 @@ <h2>Cross-Run Variability</h2>
256301
</div>
257302
</section>
258303

259-
<!-- 13. Scientific Reporting -->
260-
<section class="insight-panel" id="reporting-section">
261-
<h2>Scientific Reporting</h2>
262-
<div id="reporting-content" class="reporting-grid"></div>
263-
</section>
264-
265304
<!-- 14. Best Repeats Table (all 11) -->
266305
<section class="panel">
267306
<h2>Best Repeats</h2>

docs/styles.css

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,13 +123,14 @@ h1 {
123123
.hero-links {
124124
margin-top: 16px;
125125
display: flex;
126+
flex-wrap: wrap;
126127
gap: 12px;
127128
}
128129

129130
.hero-link {
130131
display: inline-flex;
131132
align-items: center;
132-
gap: 6px;
133+
gap: 8px;
133134
padding: 6px 14px;
134135
border-radius: 8px;
135136
border: 1px solid var(--border);
@@ -145,6 +146,25 @@ h1 {
145146
color: var(--accent-green);
146147
}
147148

149+
.hero-link-icon {
150+
display: inline-flex;
151+
align-items: center;
152+
justify-content: center;
153+
width: 14px;
154+
height: 14px;
155+
flex-shrink: 0;
156+
}
157+
158+
.hero-link-icon svg {
159+
width: 100%;
160+
height: 100%;
161+
fill: none;
162+
stroke: currentColor;
163+
stroke-width: 1.8;
164+
stroke-linecap: round;
165+
stroke-linejoin: round;
166+
}
167+
148168
code {
149169
padding: 1px 6px;
150170
border-radius: 6px;

0 commit comments

Comments
 (0)