Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,12 @@ A benchmark is a standard or point of reference against which things may be comp
### Benchmarks

System Intelligence Benchmark currently includes the following example benchmarks. Some examples are still under development — we're actively updating them. Stay tuned!

- **Course Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **Course Project Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
- **Cache Benchmark** ([benchmarks/cache_bench/](benchmarks/cache_bench/)) - Evaluates AI performance on cache algorithm design tasks
- **ArtEval Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on writing Kusto Query Language (KQL) queries for platform operations
- **SysMoBench** (`benchmarks/sysmobench/`) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

## Quick Start
Expand Down
25 changes: 25 additions & 0 deletions benchmarks/sysmobench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Trace data files (large, can be regenerated)
sysmobench_core/data/sys_traces/
sysmobench_core/data/traces/

# Python cache
__pycache__/
*.pyc
*.pyo

# Logs
logs/

# TLA+ states
states/

# Output directories
outputs/

# Virtual environment
.venv/

# Repository caches downloaded during tasks
sysmobench_core/data/repositories/
sysmobench_core/lib
sysmobench_core/output
16 changes: 16 additions & 0 deletions benchmarks/sysmobench/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM ubuntu:24.04

WORKDIR /usr/src
COPY . .

RUN apt-get update && apt-get install -y \
build-essential \
git \
wget \
python3-pip \
python3-venv \
openjdk-17-jdk

RUN chmod +x install.sh test.sh && ./install.sh

# ENTRYPOINT ["./test.sh"]
167 changes: 167 additions & 0 deletions benchmarks/sysmobench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

## Scenario Description

Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.

![SysMoBench Overview](sysmobench_core/docs/pic/overview.png)

### Task Details

- **Input**
- System source code plus task-specific prompts and invariants from `sysmobench_core/tla_eval/tasks/<task_name>/`
- Agent parameters (method, model name, iteration budget)
- TLA+ evaluation configuration and trace data
- **Output**
- Generated TLA+ specification (`GenerationResult.tla_specification`)
- Iterative correction summaries with compilation/runtime verdicts
- **Evaluation**
- `compilation_check` / `action_decomposition` (syntax correctness via SANY)
- `runtime_coverage` (TLC simulation coverage)
- `trace_validation` / `pgo_trace_validation` (conformance to real system traces)
- `invariant_verification` (system-specific safety/liveness properties)

### Key Features

- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
- Extensible configuration for adding new systems via prompts, traces, and invariant templates

## Benchmark Setup

### Prerequisites

- Python 3.9+
- Java 11+ (SANY/TLC binaries are downloaded during installation)
- Anthropic evaluator key (required for trace/invariant workflows)
- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)

### Configuration

Edit `benchmarks/sysmobench/env.toml`:

```toml
[evaluator_api_keys]
ANTHROPIC_API_KEY = "your_evaluator_key"

[llm]
OPENAI_API_KEY = "your_openai_key"
AZURE_API_KEY = "your_azure_key"
AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
AZURE_API_VERSION = "2024-05-01-preview"
ANTHROPIC_API_KEY = "your_model_key"

[hardware]
use_gpu = false

[env-docker]
image = "default"
entrypoint = "./run.sh"
```

### Install Dependencies

```bash
cd benchmarks/sysmobench
./install.sh
```

This script installs OpenJDK when necessary, creates `.venv/`, installs `benchmarks/sysmobench/requirements.txt`, registers `sysmobench_core` in editable mode (exposing `sysmobench`/`sysmobench-setup`), and downloads the TLA+ toolchain.

### Run the Benchmark

```bash
./run.sh <model_name>
```

The wrapper executes `src/main.py` with the default task list from `data/benchmark/tasks.jsonl`, iterating up to three correction rounds per task. Results are stored at:

```
outputs/sysmobench__<model>__agent_based__<timestamp>/
├── result.jsonl # Line-delimited iteration summaries per task
└── avg_score.json # Final averaged score across tasks
```

### Use the System Intelligence CLI (optional)

To orchestrate SysMoBench alongside other benchmarks:

```bash
cd cli
./run_all_local.sh <model_name>
```

### Using the upstream SysMoBench CLI (optional)

```bash
cd benchmarks/sysmobench
source .venv/bin/activate
sysmobench --task spin --method agent_based --model <model> --metric compilation_check
sysmobench --list-tasks
deactivate
```

For exhaustive CLI flag descriptions, see [Usage Guide](sysmobench_core/docs/Usage.md).

## Benchmark Tasks

SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:

| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC |
|--------|------|-------------|--------------|------------|----------|
| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 |
| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 |
| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 |
| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 |
| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 |
| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 |
| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 |
| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 |
| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 |

To list all available tasks:

```bash
sysmobench --list-tasks
```

## Evaluation Metrics

SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics:
syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.

![Evaluation Workflow](sysmobench_core/docs/pic/SysMoBench.png)


## Adding New Systems

SysMoBench is designed to be extensible. To add a new system artifact:

1. **Prepare system artifact**: Collect repository links, branch names, and any relevant materials
2. **Create task definition**: Specify modeling requirements, task configuration and related files in `task.yaml` and define invariant templates for correctness properties
3. **Instrument for trace collection**: Add logging statements to system code to collect execution traces for conformance validation

For detailed instructions, see [Adding New Systems Guide](sysmobench_core/docs/add_new_system.md).


## Project Structure

```
LLM_Gen_TLA_benchmark_framework/
├── scripts/
│ └── run_benchmark.py # Main entry script for running benchmarks
├── tla_eval/
│ ├── tasks/ # Task definitions for each system artifact
│ │ ├── spin/ # Spinlock task with prompts, configs
│ │ │ ├── prompts/ # System-specific prompts
│ │ │ └── task.yaml # Task configuration (system info)
│ │ ├── mutex/
│ │ └── ...
│ ├── models/ # LLM model interfaces and wrappers
│ ├── evaluation/ # Evaluator implementations organized by metric type
│ └── config.py # Configuration management (API keys, LLM model configs)
├── data/
│ ├── invariant_templates/ # Expert-written invariant templates for each system
│ └── traces/ # System execution traces for conformance evaluation
└── lib/ # TLA+ toolchain (tla2tools.jar for SANY and TLC)
```
9 changes: 9 additions & 0 deletions benchmarks/sysmobench/data/benchmark/tasks.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{"id": "sysmobench_001", "task_name": "spin"}
{"id": "sysmobench_002", "task_name": "mutex"}
{"id": "sysmobench_003", "task_name": "rwmutex"}
{"id": "sysmobench_004", "task_name": "etcd"}
{"id": "sysmobench_005", "task_name": "redisraft"}
{"id": "sysmobench_006", "task_name": "curp"}
{"id": "sysmobench_007", "task_name": "dqueue"}
{"id": "sysmobench_008", "task_name": "locksvc"}
{"id": "sysmobench_009", "task_name": "raftkvs"}
21 changes: 21 additions & 0 deletions benchmarks/sysmobench/env.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# ===== Evaluator API Keys (Required) =====
# SysMoBench uses Claude for evaluation workflow (e.g., trace validation, invariant generation)
# These API keys are REQUIRED for the benchmark to run
[evaluator_api_keys]
ANTHROPIC_API_KEY = "XXX"

# ===== Model Under Test API Keys =====
# Provide API keys for the models you want to test
[llm]
OPENAI_API_KEY = "XXX"
AZURE_API_KEY = "XXX"
AZURE_API_BASE = "XXX"
AZURE_API_VERSION = "XXX"
ANTHROPIC_API_KEY = "XXX"

[hardware]
use_gpu = false

[env-docker]
image = "default"
entrypoint = "./run.sh"
44 changes: 44 additions & 0 deletions benchmarks/sysmobench/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

set -e

# Ensure Java is available for TLA+ SANY/TLC.
if ! command -v java >/dev/null 2>&1; then
echo "==> Java not found. Installing OpenJDK 17..."
if command -v sudo >/dev/null 2>&1; then
sudo apt update
sudo apt install -y openjdk-17-jdk
else
apt update
apt install -y openjdk-17-jdk
fi
fi

readlink -f "$(which java)"
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
java -version

echo "==> Installing SysMoBench dependencies..."

# Create (or reuse) the benchmark virtual environment.
if [ ! -d ".venv" ]; then
echo "==> Creating .venv directory..."
python3 -m venv .venv
fi

source .venv/bin/activate

python -m pip install --upgrade pip
pip install -r requirements.txt

# Install sysmobench_core as editable so sysmobench/sysmobench-setup CLI entrypoints exist.
pip install -e sysmobench_core

# Download TLA+ tools (tla2tools.jar, CommunityModules, etc.).
echo "==> Downloading TLA+ tools..."
python3 sysmobench_core/tla_eval/setup_cli.py

deactivate

echo "==> SysMoBench environment is set up successfully."
4 changes: 4 additions & 0 deletions benchmarks/sysmobench/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# SysMoBench integration dependencies
# Reuse upstream requirements and add local runtime deps needed by the SDK bridge.
-r sysmobench_core/requirements.txt
litellm>=1.77.5
29 changes: 29 additions & 0 deletions benchmarks/sysmobench/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

set -e

if [ $# -ne 1 ]; then
echo "Usage: $0 <model_name>"
echo "Example: $0 gpt-4o"
echo "Example: $0 claude-3-5-sonnet-20241022"
exit 1
fi

MODEL_NAME="$1"
NEW_MODEL_NAME="${MODEL_NAME//\//_}"

# Activate venv if it exists
if [ -d ".venv" ]; then
source .venv/bin/activate
fi

echo "==> Start to run SysMoBench"
python3 src/main.py \
--model_name "${MODEL_NAME}" \
--agent agent_based \
--max_iterations 3

# Deactivate if we activated
if [ -d ".venv" ]; then
deactivate
fi
3 changes: 3 additions & 0 deletions benchmarks/sysmobench/src/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""SysMoBench integration for System Intelligence Benchmark Suite."""

__version__ = '0.1.0'
Loading