Skip to content

Commit faa1ff3

Browse files
authored
Merge pull request #10 from systemintelligence/sysmobench
Integrate SysMoBench benchmark via Git Subtree
2 parents 1577aca + 1a7f1b1 commit faa1ff3

File tree

513 files changed

+214738
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

513 files changed

+214738
-1
lines changed

README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,12 @@ A benchmark is a standard or point of reference against which things may be comp
1111
### Benchmarks
1212

1313
System Intelligence Benchmark currently includes the following example benchmarks. Some examples are still under development — we're actively updating them. Stay tuned!
14+
1415
- **Course Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
1516
- **Course Project Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
1617
- **Cache Benchmark** ([benchmarks/cache_bench/](benchmarks/cache_bench/)) - Evaluates AI performance on cache algorithm design tasks
1718
- **ArtEval Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on writing Kusto Query Language (KQL) queries for platform operations
19+
- **SysMoBench** (`benchmarks/sysmobench/`) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems
1820
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
1921

2022
## Quick Start
@@ -91,7 +93,7 @@ After understanding the basic concept, you can decide whether to add more tasks
9193
The easiest way to contribute is to add more tasks to existing benchmarks. For example, you can add more questions to the course exam benchmark or more projects to the course project benchmark. You can add more system algorithm design problems into algorithm design benchmark. Please follow the existing format and structure for adding new tasks. You can also improve the existing benchmarks by adding more advanced evaluators with improved metrics.
9294

9395
### Creating New Benchmarks
94-
> [!NOTE]
96+
> [!NOTE]
9597
> See [custom_benchmark.md](doc/custom_benchmark.md) for step-by-step guidelines.
9698
9799
To create a new benchmark, follow these steps:
@@ -104,6 +106,18 @@ To create a new benchmark, follow these steps:
104106
3. Implement `install.sh` and `run.sh` scripts
105107
4. Update the benchmark list in `run_all_local.sh` and `run_docker.sh` if needed
106108

109+
### Porting Existing Benchmarks
110+
> [!NOTE]
111+
> See [porting_benchmark.md](doc/porting_benchmark.md) for step-by-step guidelines.
112+
113+
For integrating existing, independently-developed benchmark projects while maintaining synchronization with upstream:
114+
115+
- Use Git Subtree/Submodule to incorporate upstream code
116+
- Write a bridge layer to connect upstream evaluators with framework SDK
117+
- Configure bidirectional sync for pulling updates and contributing fixes
118+
119+
**Example:** [SysMoBench](benchmarks/sysmobench/) - ported from [SysSpecBench](https://github.com/specula-org/SysSpecBench)
120+
107121
## Contributing
108122

109123
This project welcomes contributions and suggestions. Most contributions require you to agree to a

benchmarks/sysmobench/.gitignore

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Trace data files (large, can be regenerated)
2+
sysmobench_core/data/sys_traces/
3+
sysmobench_core/data/traces/
4+
5+
# Python cache
6+
__pycache__/
7+
*.pyc
8+
*.pyo
9+
10+
# Logs
11+
logs/
12+
13+
# TLA+ states
14+
states/
15+
16+
# Output directories
17+
outputs/
18+
19+
# Virtual environment
20+
.venv/
21+
22+
# Repository caches downloaded during tasks
23+
sysmobench_core/data/repositories/
24+
sysmobench_core/lib
25+
sysmobench_core/output

benchmarks/sysmobench/Dockerfile

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
FROM ubuntu:24.04
2+
3+
WORKDIR /usr/src
4+
COPY . .
5+
6+
RUN apt-get update && apt-get install -y \
7+
build-essential \
8+
git \
9+
wget \
10+
python3-pip \
11+
python3-venv \
12+
openjdk-17-jdk
13+
14+
RUN chmod +x install.sh test.sh && ./install.sh
15+
16+
# ENTRYPOINT ["./test.sh"]

benchmarks/sysmobench/README.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
2+
3+
## Scenario Description
4+
5+
Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.
6+
7+
![SysMoBench Overview](sysmobench_core/docs/pic/overview.png)
8+
9+
### Task Details
10+
11+
- **Input**
12+
- System source code plus task-specific prompts and invariants from `sysmobench_core/tla_eval/tasks/<task_name>/`
13+
- Agent parameters (method, model name, iteration budget)
14+
- TLA+ evaluation configuration and trace data
15+
- **Output**
16+
- Generated TLA+ specification (`GenerationResult.tla_specification`)
17+
- Iterative correction summaries with compilation/runtime verdicts
18+
- **Evaluation**
19+
- `compilation_check` / `action_decomposition` (syntax correctness via SANY)
20+
- `runtime_coverage` (TLC simulation coverage)
21+
- `trace_validation` / `pgo_trace_validation` (conformance to real system traces)
22+
- `invariant_verification` (system-specific safety/liveness properties)
23+
24+
### Key Features
25+
26+
- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
27+
- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
28+
- Extensible configuration for adding new systems via prompts, traces, and invariant templates
29+
30+
## Benchmark Setup
31+
32+
### Prerequisites
33+
34+
- Python 3.9+
35+
- Java 11+ (SANY/TLC binaries are downloaded during installation)
36+
- Anthropic evaluator key (required for trace/invariant workflows)
37+
- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)
38+
39+
### Configuration
40+
41+
Edit `benchmarks/sysmobench/env.toml`:
42+
43+
```toml
44+
[evaluator_api_keys]
45+
ANTHROPIC_API_KEY = "your_evaluator_key"
46+
47+
[llm]
48+
OPENAI_API_KEY = "your_openai_key"
49+
AZURE_API_KEY = "your_azure_key"
50+
AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
51+
AZURE_API_VERSION = "2024-05-01-preview"
52+
ANTHROPIC_API_KEY = "your_model_key"
53+
54+
[hardware]
55+
use_gpu = false
56+
57+
[env-docker]
58+
image = "default"
59+
entrypoint = "./run.sh"
60+
```
61+
62+
### Install Dependencies
63+
64+
```bash
65+
cd benchmarks/sysmobench
66+
./install.sh
67+
```
68+
69+
This script installs OpenJDK when necessary, creates `.venv/`, installs `benchmarks/sysmobench/requirements.txt`, registers `sysmobench_core` in editable mode (exposing `sysmobench`/`sysmobench-setup`), and downloads the TLA+ toolchain.
70+
71+
### Run the Benchmark
72+
73+
```bash
74+
./run.sh <model_name>
75+
```
76+
77+
The wrapper executes `src/main.py` with the default task list from `data/benchmark/tasks.jsonl`, iterating up to three correction rounds per task. Results are stored at:
78+
79+
```
80+
outputs/sysmobench__<model>__agent_based__<timestamp>/
81+
├── result.jsonl # Line-delimited iteration summaries per task
82+
└── avg_score.json # Final averaged score across tasks
83+
```
84+
85+
### Use the System Intelligence CLI (optional)
86+
87+
To orchestrate SysMoBench alongside other benchmarks:
88+
89+
```bash
90+
cd cli
91+
./run_all_local.sh <model_name>
92+
```
93+
94+
### Using the upstream SysMoBench CLI (optional)
95+
96+
```bash
97+
cd benchmarks/sysmobench
98+
source .venv/bin/activate
99+
sysmobench --task spin --method agent_based --model <model> --metric compilation_check
100+
sysmobench --list-tasks
101+
deactivate
102+
```
103+
104+
For exhaustive CLI flag descriptions, see [Usage Guide](sysmobench_core/docs/Usage.md).
105+
106+
## Benchmark Tasks
107+
108+
SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:
109+
110+
| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC |
111+
|--------|------|-------------|--------------|------------|----------|
112+
| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 |
113+
| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 |
114+
| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 |
115+
| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 |
116+
| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 |
117+
| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 |
118+
| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 |
119+
| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 |
120+
| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 |
121+
122+
To list all available tasks:
123+
124+
```bash
125+
sysmobench --list-tasks
126+
```
127+
128+
## Evaluation Metrics
129+
130+
SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics:
131+
syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.
132+
133+
![Evaluation Workflow](sysmobench_core/docs/pic/SysMoBench.png)
134+
135+
136+
## Adding New Systems
137+
138+
SysMoBench is designed to be extensible. To add a new system artifact:
139+
140+
1. **Prepare system artifact**: Collect repository links, branch names, and any relevant materials
141+
2. **Create task definition**: Specify modeling requirements, task configuration and related files in `task.yaml` and define invariant templates for correctness properties
142+
3. **Instrument for trace collection**: Add logging statements to system code to collect execution traces for conformance validation
143+
144+
For detailed instructions, see [Adding New Systems Guide](sysmobench_core/docs/add_new_system.md).
145+
146+
147+
## Project Structure
148+
149+
```
150+
LLM_Gen_TLA_benchmark_framework/
151+
├── scripts/
152+
│ └── run_benchmark.py # Main entry script for running benchmarks
153+
├── tla_eval/
154+
│ ├── tasks/ # Task definitions for each system artifact
155+
│ │ ├── spin/ # Spinlock task with prompts, configs
156+
│ │ │ ├── prompts/ # System-specific prompts
157+
│ │ │ └── task.yaml # Task configuration (system info)
158+
│ │ ├── mutex/
159+
│ │ └── ...
160+
│ ├── models/ # LLM model interfaces and wrappers
161+
│ ├── evaluation/ # Evaluator implementations organized by metric type
162+
│ └── config.py # Configuration management (API keys, LLM model configs)
163+
├── data/
164+
│ ├── invariant_templates/ # Expert-written invariant templates for each system
165+
│ └── traces/ # System execution traces for conformance evaluation
166+
└── lib/ # TLA+ toolchain (tla2tools.jar for SANY and TLC)
167+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{"id": "sysmobench_001", "task_name": "spin"}
2+
{"id": "sysmobench_002", "task_name": "mutex"}
3+
{"id": "sysmobench_003", "task_name": "rwmutex"}
4+
{"id": "sysmobench_004", "task_name": "etcd"}
5+
{"id": "sysmobench_005", "task_name": "redisraft"}
6+
{"id": "sysmobench_006", "task_name": "curp"}
7+
{"id": "sysmobench_007", "task_name": "dqueue"}
8+
{"id": "sysmobench_008", "task_name": "locksvc"}
9+
{"id": "sysmobench_009", "task_name": "raftkvs"}

benchmarks/sysmobench/env.toml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# ===== Evaluator API Keys (Required) =====
2+
# SysMoBench uses Claude for evaluation workflow (e.g., trace validation, invariant generation)
3+
# These API keys are REQUIRED for the benchmark to run
4+
[evaluator_api_keys]
5+
ANTHROPIC_API_KEY = "XXX"
6+
7+
# ===== Model Under Test API Keys =====
8+
# Provide API keys for the models you want to test
9+
[llm]
10+
OPENAI_API_KEY = "XXX"
11+
AZURE_API_KEY = "XXX"
12+
AZURE_API_BASE = "XXX"
13+
AZURE_API_VERSION = "XXX"
14+
ANTHROPIC_API_KEY = "XXX"
15+
16+
[hardware]
17+
use_gpu = false
18+
19+
[env-docker]
20+
image = "default"
21+
entrypoint = "./run.sh"

benchmarks/sysmobench/install.sh

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#!/bin/bash
2+
3+
set -e
4+
5+
# Ensure Java is available for TLA+ SANY/TLC.
6+
if ! command -v java >/dev/null 2>&1; then
7+
echo "==> Java not found. Installing OpenJDK 17..."
8+
if command -v sudo >/dev/null 2>&1; then
9+
sudo apt update
10+
sudo apt install -y openjdk-17-jdk
11+
else
12+
apt update
13+
apt install -y openjdk-17-jdk
14+
fi
15+
fi
16+
17+
readlink -f "$(which java)"
18+
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
19+
export PATH=$JAVA_HOME/bin:$PATH
20+
java -version
21+
22+
echo "==> Installing SysMoBench dependencies..."
23+
24+
# Create (or reuse) the benchmark virtual environment.
25+
if [ ! -d ".venv" ]; then
26+
echo "==> Creating .venv directory..."
27+
python3 -m venv .venv
28+
fi
29+
30+
source .venv/bin/activate
31+
32+
python -m pip install --upgrade pip
33+
pip install -r requirements.txt
34+
35+
# Install sysmobench_core as editable so sysmobench/sysmobench-setup CLI entrypoints exist.
36+
pip install -e sysmobench_core
37+
38+
# Download TLA+ tools (tla2tools.jar, CommunityModules, etc.).
39+
echo "==> Downloading TLA+ tools..."
40+
python3 sysmobench_core/tla_eval/setup_cli.py
41+
42+
deactivate
43+
44+
echo "==> SysMoBench environment is set up successfully."
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# SysMoBench integration dependencies
2+
# Reuse upstream requirements and add local runtime deps needed by the SDK bridge.
3+
-r sysmobench_core/requirements.txt
4+
litellm>=1.77.5

benchmarks/sysmobench/run.sh

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
set -e
4+
5+
if [ $# -ne 1 ]; then
6+
echo "Usage: $0 <model_name>"
7+
echo "Example: $0 gpt-4o"
8+
echo "Example: $0 claude-3-5-sonnet-20241022"
9+
exit 1
10+
fi
11+
12+
MODEL_NAME="$1"
13+
NEW_MODEL_NAME="${MODEL_NAME//\//_}"
14+
15+
# Activate venv if it exists
16+
if [ -d ".venv" ]; then
17+
source .venv/bin/activate
18+
fi
19+
20+
echo "==> Start to run SysMoBench"
21+
python3 src/main.py \
22+
--model_name "${MODEL_NAME}" \
23+
--agent agent_based \
24+
--max_iterations 3
25+
26+
# Deactivate if we activated
27+
if [ -d ".venv" ]; then
28+
deactivate
29+
fi
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""SysMoBench integration for System Intelligence Benchmark Suite."""
2+
3+
__version__ = '0.1.0'

0 commit comments

Comments
 (0)