sys-intelligence · xuafeng · Nov 17, 2025 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/README.md b/README.md
@@ -11,10 +11,12 @@ A benchmark is a standard or point of reference against which things may be comp
 ### Benchmarks
 
 System Intelligence Benchmark currently includes the following example benchmarks. Some examples are still under development — we're actively updating them. Stay tuned!
+
 - **Course Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
 - **Course Project Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
 - **Cache Benchmark** ([benchmarks/cache_bench/](benchmarks/cache_bench/)) - Evaluates AI performance on cache algorithm design tasks
 - **ArtEval Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on writing Kusto Query Language (KQL) queries for platform operations
+- **SysMoBench** (`benchmarks/sysmobench/`) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems
 - **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
 
 ## Quick Start

diff --git a/benchmarks/sysmobench/.gitignore b/benchmarks/sysmobench/.gitignore
@@ -0,0 +1,25 @@
+# Trace data files (large, can be regenerated)
+sysmobench_core/data/sys_traces/
+sysmobench_core/data/traces/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# Logs
+logs/
+
+# TLA+ states
+states/
+
+# Output directories
+outputs/
+
+# Virtual environment
+.venv/
+
+# Repository caches downloaded during tasks
+sysmobench_core/data/repositories/
+sysmobench_core/lib
+sysmobench_core/output
diff --git a/benchmarks/sysmobench/Dockerfile b/benchmarks/sysmobench/Dockerfile
@@ -0,0 +1,16 @@
+FROM ubuntu:24.04
+
+WORKDIR /usr/src
+COPY . .
+
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    wget \
+    python3-pip \
+    python3-venv \
+    openjdk-17-jdk
+
+RUN chmod +x install.sh test.sh && ./install.sh
+
+# ENTRYPOINT ["./test.sh"]
diff --git a/benchmarks/sysmobench/README.md b/benchmarks/sysmobench/README.md
@@ -0,0 +1,167 @@
+# SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
+
+## Scenario Description
+
+Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.
+
+![SysMoBench Overview](sysmobench_core/docs/pic/overview.png)
+
+### Task Details
+
+- **Input**
+  - System source code plus task-specific prompts and invariants from `sysmobench_core/tla_eval/tasks/<task_name>/`
+  - Agent parameters (method, model name, iteration budget)
+  - TLA+ evaluation configuration and trace data
+- **Output**
+  - Generated TLA+ specification (`GenerationResult.tla_specification`)
+  - Iterative correction summaries with compilation/runtime verdicts
+- **Evaluation**
+  - `compilation_check` / `action_decomposition` (syntax correctness via SANY)
+  - `runtime_coverage` (TLC simulation coverage)
+  - `trace_validation` / `pgo_trace_validation` (conformance to real system traces)
+  - `invariant_verification` (system-specific safety/liveness properties)
+
+### Key Features
+
+- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
+- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
+- Extensible configuration for adding new systems via prompts, traces, and invariant templates
+
+## Benchmark Setup
+
+### Prerequisites
+
+- Python 3.9+
+- Java 11+ (SANY/TLC binaries are downloaded during installation)
+- Anthropic evaluator key (required for trace/invariant workflows)
+- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)
+
+### Configuration
+
+Edit `benchmarks/sysmobench/env.toml`:
+
+```toml
+[evaluator_api_keys]
+ANTHROPIC_API_KEY = "your_evaluator_key"
+
+[llm]
+OPENAI_API_KEY = "your_openai_key"
+AZURE_API_KEY = "your_azure_key"
+AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
+AZURE_API_VERSION = "2024-05-01-preview"
+ANTHROPIC_API_KEY = "your_model_key"
+
+[hardware]
+use_gpu = false
+
+[env-docker]
+image = "default"
+entrypoint = "./run.sh"
+```
+
+### Install Dependencies
+
+```bash
+cd benchmarks/sysmobench
+./install.sh
+```
+
+This script installs OpenJDK when necessary, creates `.venv/`, installs `benchmarks/sysmobench/requirements.txt`, registers `sysmobench_core` in editable mode (exposing `sysmobench`/`sysmobench-setup`), and downloads the TLA+ toolchain.
+
+### Run the Benchmark
+
+```bash
+./run.sh <model_name>
+```
+
+The wrapper executes `src/main.py` with the default task list from `data/benchmark/tasks.jsonl`, iterating up to three correction rounds per task. Results are stored at:
+
+```
+outputs/sysmobench__<model>__agent_based__<timestamp>/
+├── result.jsonl   # Line-delimited iteration summaries per task
+└── avg_score.json # Final averaged score across tasks
+```
+
+### Use the System Intelligence CLI (optional)
+
+To orchestrate SysMoBench alongside other benchmarks:
+
+```bash
+cd cli
+./run_all_local.sh <model_name>
+```
+
+### Using the upstream SysMoBench CLI (optional)
+
+```bash
+cd benchmarks/sysmobench
+source .venv/bin/activate
+sysmobench --task spin --method agent_based --model <model> --metric compilation_check
+sysmobench --list-tasks
+deactivate
+```
+
+For exhaustive CLI flag descriptions, see [Usage Guide](sysmobench_core/docs/Usage.md).
+
+## Benchmark Tasks
+
+SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:
+
+| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC |
+|--------|------|-------------|--------------|------------|----------|
+| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 |
+| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 |
+| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 |
+| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 |
+| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 |
+| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 |
+| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 |
+| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 |
+| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 |
+
+To list all available tasks:
+
+```bash
+sysmobench --list-tasks
+```
+
+## Evaluation Metrics
+
+SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics:
+   syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.
+
+![Evaluation Workflow](sysmobench_core/docs/pic/SysMoBench.png)
+
+
+## Adding New Systems
+
+SysMoBench is designed to be extensible. To add a new system artifact:
+
+1. **Prepare system artifact**: Collect repository links, branch names, and any relevant materials
+2. **Create task definition**: Specify modeling requirements, task configuration and related files in `task.yaml` and define invariant templates for correctness properties
+3. **Instrument for trace collection**: Add logging statements to system code to collect execution traces for conformance validation
+
+For detailed instructions, see [Adding New Systems Guide](sysmobench_core/docs/add_new_system.md).
+
+
+## Project Structure
+
+```
+LLM_Gen_TLA_benchmark_framework/
+├── scripts/
+│   └── run_benchmark.py          # Main entry script for running benchmarks
+├── tla_eval/
+│   ├── tasks/                    # Task definitions for each system artifact
+│   │   ├── spin/                 # Spinlock task with prompts, configs
+│   │   │   ├── prompts/          # System-specific prompts
+│   │   │   └── task.yaml         # Task configuration (system info)
+│   │   ├── mutex/
+│   │   └── ...
+│   ├── models/                   # LLM model interfaces and wrappers
+│   ├── evaluation/               # Evaluator implementations organized by metric type
+│   └── config.py                 # Configuration management (API keys, LLM model configs)
+├── data/
+│   ├── invariant_templates/      # Expert-written invariant templates for each system
+│   └── traces/                   # System execution traces for conformance evaluation
+└── lib/                          # TLA+ toolchain (tla2tools.jar for SANY and TLC)
+```
diff --git a/benchmarks/sysmobench/data/benchmark/tasks.jsonl b/benchmarks/sysmobench/data/benchmark/tasks.jsonl
@@ -0,0 +1,9 @@
+{"id": "sysmobench_001", "task_name": "spin"}
+{"id": "sysmobench_002", "task_name": "mutex"}
+{"id": "sysmobench_003", "task_name": "rwmutex"}
+{"id": "sysmobench_004", "task_name": "etcd"}
+{"id": "sysmobench_005", "task_name": "redisraft"}
+{"id": "sysmobench_006", "task_name": "curp"}
+{"id": "sysmobench_007", "task_name": "dqueue"}
+{"id": "sysmobench_008", "task_name": "locksvc"}
+{"id": "sysmobench_009", "task_name": "raftkvs"}
diff --git a/benchmarks/sysmobench/env.toml b/benchmarks/sysmobench/env.toml
@@ -0,0 +1,21 @@
+# ===== Evaluator API Keys (Required) =====
+# SysMoBench uses Claude for evaluation workflow (e.g., trace validation, invariant generation)
+# These API keys are REQUIRED for the benchmark to run
+[evaluator_api_keys]
+ANTHROPIC_API_KEY = "XXX"
+
+# ===== Model Under Test API Keys =====
+# Provide API keys for the models you want to test
+[llm]
+OPENAI_API_KEY = "XXX"
+AZURE_API_KEY = "XXX"
+AZURE_API_BASE = "XXX"
+AZURE_API_VERSION = "XXX"
+ANTHROPIC_API_KEY = "XXX"
+
+[hardware]
+use_gpu = false
+
+[env-docker]
+image = "default"
+entrypoint = "./run.sh"
diff --git a/benchmarks/sysmobench/install.sh b/benchmarks/sysmobench/install.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -e
+
+# Ensure Java is available for TLA+ SANY/TLC.
+if ! command -v java >/dev/null 2>&1; then
+    echo "==> Java not found. Installing OpenJDK 17..."
+    if command -v sudo >/dev/null 2>&1; then
+        sudo apt update
+        sudo apt install -y openjdk-17-jdk
+    else
+        apt update
+        apt install -y openjdk-17-jdk
+    fi
+fi
+
+readlink -f "$(which java)"
+export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
+export PATH=$JAVA_HOME/bin:$PATH
+java -version
+
+echo "==> Installing SysMoBench dependencies..."
+
+# Create (or reuse) the benchmark virtual environment.
+if [ ! -d ".venv" ]; then
+    echo "==> Creating .venv directory..."
+    python3 -m venv .venv
+fi
+
+source .venv/bin/activate
+
+python -m pip install --upgrade pip
+pip install -r requirements.txt
+
+# Install sysmobench_core as editable so sysmobench/sysmobench-setup CLI entrypoints exist.
+pip install -e sysmobench_core
+
+# Download TLA+ tools (tla2tools.jar, CommunityModules, etc.).
+echo "==> Downloading TLA+ tools..."
+python3 sysmobench_core/tla_eval/setup_cli.py
+
+deactivate
+
+echo "==> SysMoBench environment is set up successfully."
diff --git a/benchmarks/sysmobench/requirements.txt b/benchmarks/sysmobench/requirements.txt
@@ -0,0 +1,4 @@
+# SysMoBench integration dependencies
+# Reuse upstream requirements and add local runtime deps needed by the SDK bridge.
+-r sysmobench_core/requirements.txt
+litellm>=1.77.5
diff --git a/benchmarks/sysmobench/run.sh b/benchmarks/sysmobench/run.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+set -e
+
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 <model_name>"
+    echo "Example: $0 gpt-4o"
+    echo "Example: $0 claude-3-5-sonnet-20241022"
+    exit 1
+fi
+
+MODEL_NAME="$1"
+NEW_MODEL_NAME="${MODEL_NAME//\//_}"
+
+# Activate venv if it exists
+if [ -d ".venv" ]; then
+    source .venv/bin/activate
+fi
+
+echo "==> Start to run SysMoBench"
+python3 src/main.py \
+    --model_name "${MODEL_NAME}" \
+    --agent agent_based \
+    --max_iterations 3
+
+# Deactivate if we activated
+if [ -d ".venv" ]; then
+    deactivate
+fi
diff --git a/benchmarks/sysmobench/src/__init__.py b/benchmarks/sysmobench/src/__init__.py
@@ -0,0 +1,3 @@
+"""SysMoBench integration for System Intelligence Benchmark Suite."""
+
+__version__ = '0.1.0'
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		"""SysMoBench integration for System Intelligence Benchmark Suite."""

		__version__ = '0.1.0'