sys-intelligence
diff --git a/‎README.md‎
Lines changed: 15 additions & 1 deletion b/‎README.md‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎benchmarks/sysmobench/.gitignore‎
Lines changed: 25 additions & 0 deletions b/‎benchmarks/sysmobench/.gitignore‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/Dockerfile‎
Lines changed: 16 additions & 0 deletions b/‎benchmarks/sysmobench/Dockerfile‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/README.md‎
Lines changed: 167 additions & 0 deletions b/‎benchmarks/sysmobench/README.md‎
Lines changed: 167 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/data/benchmark/tasks.jsonl‎
Lines changed: 9 additions & 0 deletions b/‎benchmarks/sysmobench/data/benchmark/tasks.jsonl‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/env.toml‎
Lines changed: 21 additions & 0 deletions b/‎benchmarks/sysmobench/env.toml‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/install.sh‎
Lines changed: 44 additions & 0 deletions b/‎benchmarks/sysmobench/install.sh‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/requirements.txt‎
Lines changed: 4 additions & 0 deletions b/‎benchmarks/sysmobench/requirements.txt‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/run.sh‎
Lines changed: 29 additions & 0 deletions b/‎benchmarks/sysmobench/run.sh‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎benchmarks/sysmobench/src/__init__.py‎
Lines changed: 3 additions & 0 deletions b/‎benchmarks/sysmobench/src/__init__.py‎
Lines changed: 3 additions & 0 deletions
@@ -11,10 +11,12 @@ A benchmark is a standard or point of reference against which things may be comp
 ### Benchmarks
 
 System Intelligence Benchmark currently includes the following example benchmarks. Some examples are still under development — we're actively updating them. Stay tuned!
+
 - **Course Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
 - **Course Project Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
 - **Cache Benchmark** ([benchmarks/cache_bench/](benchmarks/cache_bench/)) - Evaluates AI performance on cache algorithm design tasks
 - **ArtEval Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on writing Kusto Query Language (KQL) queries for platform operations
+- **SysMoBench** (`benchmarks/sysmobench/`) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems
 - **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
 
 ## Quick Start
@@ -91,7 +93,7 @@ After understanding the basic concept, you can decide whether to add more tasks
 The easiest way to contribute is to add more tasks to existing benchmarks. For example, you can add more questions to the course exam benchmark or more projects to the course project benchmark. You can add more system algorithm design problems into algorithm design benchmark. Please follow the existing format and structure for adding new tasks. You can also improve the existing benchmarks by adding more advanced evaluators with improved metrics.
 
 ### Creating New Benchmarks
-> [!NOTE] 
+> [!NOTE]
 > See [custom_benchmark.md](doc/custom_benchmark.md) for step-by-step guidelines.
 
 To create a new benchmark, follow these steps:
@@ -104,6 +106,18 @@ To create a new benchmark, follow these steps:
 3. Implement `install.sh` and `run.sh` scripts
 4. Update the benchmark list in `run_all_local.sh` and `run_docker.sh` if needed
 
+### Porting Existing Benchmarks
+> [!NOTE]
+> See [porting_benchmark.md](doc/porting_benchmark.md) for step-by-step guidelines.
+
+For integrating existing, independently-developed benchmark projects while maintaining synchronization with upstream:
+
+- Use Git Subtree/Submodule to incorporate upstream code
+- Write a bridge layer to connect upstream evaluators with framework SDK
+- Configure bidirectional sync for pulling updates and contributing fixes
+
+**Example:** [SysMoBench](benchmarks/sysmobench/) - ported from [SysSpecBench](https://github.com/specula-org/SysSpecBench)
+
 ## Contributing
 
 This project welcomes contributions and suggestions.  Most contributions require you to agree to a
 
@@ -0,0 +1,25 @@
+# Trace data files (large, can be regenerated)
+sysmobench_core/data/sys_traces/
+sysmobench_core/data/traces/
+
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+
+# Logs
+logs/
+
+# TLA+ states
+states/
+
+# Output directories
+outputs/
+
+# Virtual environment
+.venv/
+
+# Repository caches downloaded during tasks
+sysmobench_core/data/repositories/
+sysmobench_core/lib
+sysmobench_core/output
@@ -0,0 +1,16 @@
+FROM ubuntu:24.04
+
+WORKDIR /usr/src
+COPY . .
+
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    wget \
+    python3-pip \
+    python3-venv \
+    openjdk-17-jdk
+
+RUN chmod +x install.sh test.sh && ./install.sh
+
+# ENTRYPOINT ["./test.sh"]
@@ -0,0 +1,167 @@
+# SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
+
+## Scenario Description
+
+Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks.
+
+![SysMoBench Overview](sysmobench_core/docs/pic/overview.png)
+
+### Task Details
+
+- **Input**
+  - System source code plus task-specific prompts and invariants from `sysmobench_core/tla_eval/tasks/<task_name>/`
+  - Agent parameters (method, model name, iteration budget)
+  - TLA+ evaluation configuration and trace data
+- **Output**
+  - Generated TLA+ specification (`GenerationResult.tla_specification`)
+  - Iterative correction summaries with compilation/runtime verdicts
+- **Evaluation**
+  - `compilation_check` / `action_decomposition` (syntax correctness via SANY)
+  - `runtime_coverage` (TLC simulation coverage)
+  - `trace_validation` / `pgo_trace_validation` (conformance to real system traces)
+  - `invariant_verification` (system-specific safety/liveness properties)
+
+### Key Features
+
+- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants
+- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites)
+- Extensible configuration for adding new systems via prompts, traces, and invariant templates
+
+## Benchmark Setup
+
+### Prerequisites
+
+- Python 3.9+
+- Java 11+ (SANY/TLC binaries are downloaded during installation)
+- Anthropic evaluator key (required for trace/invariant workflows)
+- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.)
+
+### Configuration
+
+Edit `benchmarks/sysmobench/env.toml`:
+
+```toml
+[evaluator_api_keys]
+ANTHROPIC_API_KEY = "your_evaluator_key"
+
+[llm]
+OPENAI_API_KEY = "your_openai_key"
+AZURE_API_KEY = "your_azure_key"
+AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com"
+AZURE_API_VERSION = "2024-05-01-preview"
+ANTHROPIC_API_KEY = "your_model_key"
+
+[hardware]
+use_gpu = false
+
+[env-docker]
+image = "default"
+entrypoint = "./run.sh"
+```
+
+### Install Dependencies
+
+```bash
+cd benchmarks/sysmobench
+./install.sh
+```
+
+This script installs OpenJDK when necessary, creates `.venv/`, installs `benchmarks/sysmobench/requirements.txt`, registers `sysmobench_core` in editable mode (exposing `sysmobench`/`sysmobench-setup`), and downloads the TLA+ toolchain.
+
+### Run the Benchmark
+
+```bash
+./run.sh <model_name>
+```
+
+The wrapper executes `src/main.py` with the default task list from `data/benchmark/tasks.jsonl`, iterating up to three correction rounds per task. Results are stored at:
+
+```
+outputs/sysmobench__<model>__agent_based__<timestamp>/
+├── result.jsonl   # Line-delimited iteration summaries per task
+└── avg_score.json # Final averaged score across tasks
+```
+
+### Use the System Intelligence CLI (optional)
+
+To orchestrate SysMoBench alongside other benchmarks:
+
+```bash
+cd cli
+./run_all_local.sh <model_name>
+```
+
+### Using the upstream SysMoBench CLI (optional)
+
+```bash
+cd benchmarks/sysmobench
+source .venv/bin/activate
+sysmobench --task spin --method agent_based --model <model> --metric compilation_check
+sysmobench --list-tasks
+deactivate
+```
+
+For exhaustive CLI flag descriptions, see [Usage Guide](sysmobench_core/docs/Usage.md).
+
+## Benchmark Tasks
+
+SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems:
+
+| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC |
+|--------|------|-------------|--------------|------------|----------|
+| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 |
+| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 |
+| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 |
+| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 |
+| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 |
+| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 |
+| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 |
+| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 |
+| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 |
+
+To list all available tasks:
+
+```bash
+sysmobench --list-tasks
+```
+
+## Evaluation Metrics
+
+SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics:
+   syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness.
+
+![Evaluation Workflow](sysmobench_core/docs/pic/SysMoBench.png)
+
+
+## Adding New Systems
+
+SysMoBench is designed to be extensible. To add a new system artifact:
+
+1. **Prepare system artifact**: Collect repository links, branch names, and any relevant materials
+2. **Create task definition**: Specify modeling requirements, task configuration and related files in `task.yaml` and define invariant templates for correctness properties
+3. **Instrument for trace collection**: Add logging statements to system code to collect execution traces for conformance validation
+
+For detailed instructions, see [Adding New Systems Guide](sysmobench_core/docs/add_new_system.md).
+
+
+## Project Structure
+
+```
+LLM_Gen_TLA_benchmark_framework/
+├── scripts/
+│   └── run_benchmark.py          # Main entry script for running benchmarks
+├── tla_eval/
+│   ├── tasks/                    # Task definitions for each system artifact
+│   │   ├── spin/                 # Spinlock task with prompts, configs
+│   │   │   ├── prompts/          # System-specific prompts
+│   │   │   └── task.yaml         # Task configuration (system info)
+│   │   ├── mutex/
+│   │   └── ...
+│   ├── models/                   # LLM model interfaces and wrappers
+│   ├── evaluation/               # Evaluator implementations organized by metric type
+│   └── config.py                 # Configuration management (API keys, LLM model configs)
+├── data/
+│   ├── invariant_templates/      # Expert-written invariant templates for each system
+│   └── traces/                   # System execution traces for conformance evaluation
+└── lib/                          # TLA+ toolchain (tla2tools.jar for SANY and TLC)
+```
@@ -0,0 +1,9 @@
+{"id": "sysmobench_001", "task_name": "spin"}
+{"id": "sysmobench_002", "task_name": "mutex"}
+{"id": "sysmobench_003", "task_name": "rwmutex"}
+{"id": "sysmobench_004", "task_name": "etcd"}
+{"id": "sysmobench_005", "task_name": "redisraft"}
+{"id": "sysmobench_006", "task_name": "curp"}
+{"id": "sysmobench_007", "task_name": "dqueue"}
+{"id": "sysmobench_008", "task_name": "locksvc"}
+{"id": "sysmobench_009", "task_name": "raftkvs"}
@@ -0,0 +1,21 @@
+# ===== Evaluator API Keys (Required) =====
+# SysMoBench uses Claude for evaluation workflow (e.g., trace validation, invariant generation)
+# These API keys are REQUIRED for the benchmark to run
+[evaluator_api_keys]
+ANTHROPIC_API_KEY = "XXX"
+
+# ===== Model Under Test API Keys =====
+# Provide API keys for the models you want to test
+[llm]
+OPENAI_API_KEY = "XXX"
+AZURE_API_KEY = "XXX"
+AZURE_API_BASE = "XXX"
+AZURE_API_VERSION = "XXX"
+ANTHROPIC_API_KEY = "XXX"
+
+[hardware]
+use_gpu = false
+
+[env-docker]
+image = "default"
+entrypoint = "./run.sh"
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -e
+
+# Ensure Java is available for TLA+ SANY/TLC.
+if ! command -v java >/dev/null 2>&1; then
+    echo "==> Java not found. Installing OpenJDK 17..."
+    if command -v sudo >/dev/null 2>&1; then
+        sudo apt update
+        sudo apt install -y openjdk-17-jdk
+    else
+        apt update
+        apt install -y openjdk-17-jdk
+    fi
+fi
+
+readlink -f "$(which java)"
+export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
+export PATH=$JAVA_HOME/bin:$PATH
+java -version
+
+echo "==> Installing SysMoBench dependencies..."
+
+# Create (or reuse) the benchmark virtual environment.
+if [ ! -d ".venv" ]; then
+    echo "==> Creating .venv directory..."
+    python3 -m venv .venv
+fi
+
+source .venv/bin/activate
+
+python -m pip install --upgrade pip
+pip install -r requirements.txt
+
+# Install sysmobench_core as editable so sysmobench/sysmobench-setup CLI entrypoints exist.
+pip install -e sysmobench_core
+
+# Download TLA+ tools (tla2tools.jar, CommunityModules, etc.).
+echo "==> Downloading TLA+ tools..."
+python3 sysmobench_core/tla_eval/setup_cli.py
+
+deactivate
+
+echo "==> SysMoBench environment is set up successfully."
@@ -0,0 +1,4 @@
+# SysMoBench integration dependencies
+# Reuse upstream requirements and add local runtime deps needed by the SDK bridge.
+-r sysmobench_core/requirements.txt
+litellm>=1.77.5
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+set -e
+
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 <model_name>"
+    echo "Example: $0 gpt-4o"
+    echo "Example: $0 claude-3-5-sonnet-20241022"
+    exit 1
+fi
+
+MODEL_NAME="$1"
+NEW_MODEL_NAME="${MODEL_NAME//\//_}"
+
+# Activate venv if it exists
+if [ -d ".venv" ]; then
+    source .venv/bin/activate
+fi
+
+echo "==> Start to run SysMoBench"
+python3 src/main.py \
+    --model_name "${MODEL_NAME}" \
+    --agent agent_based \
+    --max_iterations 3
+
+# Deactivate if we activated
+if [ -d ".venv" ]; then
+    deactivate
+fi
@@ -0,0 +1,3 @@
+"""SysMoBench integration for System Intelligence Benchmark Suite."""
+
+__version__ = '0.1.0'
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+"""SysMoBench integration for System Intelligence Benchmark Suite."""`
	`2`	`+`
	`3`	`+__version__ = '0.1.0'`