Skip to content

Commit 96c540e

Browse files
authored
Merge pull request #39 from YangChenyuan/leaderboard
feat(leaderboard): Add the webpage for Verus-Bench and VeruSAGE-Bench
2 parents 27e416c + abe4f30 commit 96c540e

File tree

12 files changed

+3270
-0
lines changed

12 files changed

+3270
-0
lines changed
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# GitHub Actions workflow to deploy the leaderboard to GitHub Pages
2+
name: Deploy Leaderboard to GitHub Pages
3+
4+
on:
5+
# Runs on pushes targeting the default branch
6+
push:
7+
branches: ["main", "leaderboard"]
8+
paths:
9+
- 'leaderboard/**'
10+
11+
# Allows you to run this workflow manually from the Actions tab
12+
workflow_dispatch:
13+
14+
# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
15+
permissions:
16+
contents: read
17+
pages: write
18+
id-token: write
19+
20+
# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
21+
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
22+
concurrency:
23+
group: "pages"
24+
cancel-in-progress: false
25+
26+
jobs:
27+
# Build job
28+
build:
29+
runs-on: ubuntu-latest
30+
steps:
31+
- name: Checkout
32+
uses: actions/checkout@v4
33+
34+
- name: Setup Pages
35+
uses: actions/configure-pages@v5
36+
37+
- name: Upload artifact
38+
uses: actions/upload-pages-artifact@v3
39+
with:
40+
# Upload the leaderboard directory
41+
path: './leaderboard'
42+
43+
# Deployment job
44+
deploy:
45+
environment:
46+
name: github-pages
47+
url: ${{ steps.deployment.outputs.page_url }}
48+
runs-on: ubuntu-latest
49+
needs: build
50+
steps:
51+
- name: Deploy to GitHub Pages
52+
id: deployment
53+
uses: actions/deploy-pages@v4

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,13 @@
44
<a href="https://arxiv.org/abs/2409.13082"><img src="https://img.shields.io/badge/AutoVerus-arXiv%202409.13082-b31b1b.svg?style=for-the-badge"></a>
55
<a href="https://arxiv.org/abs/2512.18436"><img src="https://img.shields.io/badge/VeruSAGE-arXiv%202512.18436-b31b1b.svg?style=for-the-badge"></a>
66
<a href="https://www.microsoft.com/en-us/research/project/practical-system-verification/"><img src="https://img.shields.io/badge/Website-blue.svg?style=for-the-badge"></a>
7+
<a href="https://microsoft.github.io/verus-proof-synthesis/"><img src="https://img.shields.io/badge/🏆_Leaderboard-View_Results-6366f1.svg?style=for-the-badge"></a>
78
</p>
89

910
This repository contains code and artifacts for automated [Verus](https://github.com/verus-lang/verus) proof synthesis using LLM-based approaches. It includes two proof synthesis systems and two benchmark suites.
1011

12+
> 🏆 **[View the Leaderboard](https://microsoft.github.io/verus-proof-synthesis/)** — Compare proof synthesis systems on our benchmarks!
13+
1114
---
1215

1316
## � Repository Contents

leaderboard/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
tmp-raw-results-*/

leaderboard/README.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Verus Proof Synthesis Leaderboard
2+
3+
A web-based leaderboard for tracking progress in LLM-based formal verification for Rust using Verus.
4+
5+
## 🌐 Live Website
6+
7+
Visit the leaderboard at: **[https://microsoft.github.io/verus-proof-synthesis/](https://microsoft.github.io/verus-proof-synthesis/)**
8+
9+
## 📁 Directory Structure
10+
11+
```
12+
leaderboard/
13+
├── index.html # Main leaderboard page
14+
├── submit.html # Submission guidelines
15+
├── about.html # About page
16+
├── css/
17+
│ └── main.css # Stylesheet
18+
├── js/
19+
│ └── leaderboard.js # Interactive functionality
20+
├── data/
21+
│ ├── schema.json # JSON schema for submissions
22+
│ ├── verus-bench-results.json # Verus-Bench leaderboard data
23+
│ └── verusage-bench-results.json # VeruSAGE-Bench leaderboard data
24+
└── README.md # This file
25+
```
26+
27+
## 🚀 Running Locally
28+
29+
The leaderboard is a static website that can be served by any web server:
30+
31+
```bash
32+
# Using Python
33+
cd leaderboard
34+
python -m http.server 8000
35+
# Visit http://localhost:8000
36+
37+
# Using Node.js (if you have serve installed)
38+
npx serve .
39+
40+
# Using PHP
41+
php -S localhost:8000
42+
```
43+
44+
## 📤 Submitting Results
45+
46+
### Quick Start
47+
48+
1. **Run evaluation** on Verus-Bench or VeruSAGE-Bench
49+
2. **Format results** according to our [JSON schema](data/schema.json)
50+
3. **Open a PR** adding your entry to the appropriate data file
51+
52+
### Submission Format
53+
54+
```json
55+
{
56+
"submission_id": "your-system-model-version",
57+
"system_name": "Your System Name",
58+
"model": "LLM Model Used",
59+
"date": "YYYY-MM-DD",
60+
"results": {
61+
"solved": 135,
62+
"total": 150,
63+
"percent_solved": 90.0,
64+
"avg_time_seconds": 28.5,
65+
"avg_cost_usd": 0.25
66+
},
67+
"breakdown": [
68+
{"category": "CloverBench", "solved": 11, "total": 11}
69+
],
70+
"paper_url": "https://arxiv.org/abs/...",
71+
"code_url": "https://github.com/...",
72+
"verified": false,
73+
"notes": "Brief description"
74+
}
75+
```
76+
77+
### Required Fields
78+
79+
| Field | Description |
80+
|-------|-------------|
81+
| `submission_id` | Unique identifier (e.g., "mysystem-gpt4-v1.0") |
82+
| `system_name` | Name of your proof synthesis system |
83+
| `model` | LLM model used |
84+
| `date` | Submission date (YYYY-MM-DD) |
85+
| `results.solved` | Number of tasks solved |
86+
| `results.total` | Total tasks attempted |
87+
| `results.percent_solved` | Percentage solved |
88+
89+
### Optional Fields
90+
91+
| Field | Description |
92+
|-------|-------------|
93+
| `results.avg_time_seconds` | Average time per task (Verus-Bench) |
94+
| `results.avg_time_minutes` | Average time per task (VeruSAGE-Bench) |
95+
| `results.avg_cost_usd` | Average cost per task in USD |
96+
| `breakdown` | Per-source/project breakdown |
97+
| `paper_url` | Link to paper |
98+
| `code_url` | Link to code repository |
99+
| `notes` | Additional information |
100+
101+
## ✅ Verification Status
102+
103+
Submissions are labeled with verification status:
104+
105+
- **Verified** — Results independently reproduced by maintainers
106+
107+
To expedite verification, please:
108+
- Provide detailed reproduction instructions
109+
- Make evaluation scripts publicly available
110+
- Include exact Verus version used
111+
112+
## ⚠️ Rules
113+
114+
1. **No cheating**: Submissions using `assume false`, `#[verifier::external_body]`, or other trivial solutions will be rejected
115+
2. **Use standard Verus**: Use the recommended Verus version in benchmark README
116+
3. **Report honestly**: Results should be accurate and reproducible
117+
4. **One entry per configuration**: Submit separate entries for different model/system combinations
118+
119+
## 🛠️ Development
120+
121+
### Modifying the Leaderboard
122+
123+
1. Edit HTML/CSS/JS files as needed
124+
2. Test locally with a web server
125+
3. Submit a PR with your changes
126+
127+
### Adding New Data
128+
129+
To add a new submission:
130+
131+
1. Edit `data/verus-bench-results.json` or `data/verusage-bench-results.json`
132+
2. Add your entry to the `submissions` array
133+
3. Ensure your entry follows the schema in `data/schema.json`
134+
4. Submit a PR
135+
136+
## 📊 Benchmarks
137+
138+
### Verus-Bench (150 tasks)
139+
140+
Algorithm-level verification tasks from classic CS problems.
141+
142+
| Source | Tasks | Description |
143+
|--------|-------|-------------|
144+
| CloverBench | 11 | Classic CS examples |
145+
| MBPP | 78 | Formal specification problems |
146+
| Diffy | 38 | Array/loop programs |
147+
| Misc | 23 | Verus tutorial examples |
148+
149+
### VeruSAGE-Bench (849 tasks)
150+
151+
Repository-level verification tasks from real-world systems.
152+
153+
| Project | Code | Tasks | Domain |
154+
|---------|------|-------|--------|
155+
| Anvil | AL | 104 | Distributed Systems |
156+
| Anvil Advanced | AC | 63 | Distributed Systems |
157+
| IronKV | IR | 118 | Key-Value Store |
158+
| Memory Allocator | MA | 89 | Systems |
159+
| Node Replication | NO | 29 | Distributed Systems |
160+
| NRKernel | NR | 204 | OS Kernel |
161+
| ATMO | OS | 157 | Microkernel |
162+
| Storage | ST | 63 | Storage Systems |
163+
| Vest | VE | 22 | Serialization |
164+
165+
## 📚 Citation
166+
167+
```bibtex
168+
@article{autoverus,
169+
title={AutoVerus: Automated Proof Generation for Rust Code},
170+
author={Yang, Chenyuan and Li, Xuheng and Misu, Md Rakib Hossain and others},
171+
journal={PACMPL},
172+
volume={9},
173+
number={OOPSLA2},
174+
year={2025}
175+
}
176+
177+
@misc{verusage,
178+
title={VeruSAGE: A Study of Agent-Based Verification for Rust Systems},
179+
author={Yang, Chenyuan and Neamtu, Natalie and Hawblitzel, Chris and others},
180+
year={2025},
181+
eprint={2512.18436},
182+
archivePrefix={arXiv}
183+
}
184+
```
185+
186+
## 📧 Contact
187+
188+
For questions or issues, please open an issue on the [main repository](https://github.com/microsoft/verus-proof-synthesis/issues).

0 commit comments

Comments
 (0)