|
| 1 | +# SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems |
| 2 | + |
| 3 | +## Scenario Description |
| 4 | + |
| 5 | +Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. SysMoBench evaluates whether LLM-powered agents can translate realistic system artifacts into executable TLA+ specifications that pass rigorous syntactic, semantic, and invariant checks. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +### Task Details |
| 10 | + |
| 11 | +- **Input** |
| 12 | + - System source code plus task-specific prompts and invariants from `sysmobench_core/tla_eval/tasks/<task_name>/` |
| 13 | + - Agent parameters (method, model name, iteration budget) |
| 14 | + - TLA+ evaluation configuration and trace data |
| 15 | +- **Output** |
| 16 | + - Generated TLA+ specification (`GenerationResult.tla_specification`) |
| 17 | + - Iterative correction summaries with compilation/runtime verdicts |
| 18 | +- **Evaluation** |
| 19 | + - `compilation_check` / `action_decomposition` (syntax correctness via SANY) |
| 20 | + - `runtime_coverage` (TLC simulation coverage) |
| 21 | + - `trace_validation` / `pgo_trace_validation` (conformance to real system traces) |
| 22 | + - `invariant_verification` (system-specific safety/liveness properties) |
| 23 | + |
| 24 | +### Key Features |
| 25 | + |
| 26 | +- Automated four-phase evaluation pipeline covering syntax, runtime, trace conformance, and invariants |
| 27 | +- Nine real-world concurrent and distributed systems (Etcd Raft, Redis Raft, Asterinas primitives, Xline CURP, PGo suites) |
| 28 | +- Extensible configuration for adding new systems via prompts, traces, and invariant templates |
| 29 | + |
| 30 | +## Benchmark Setup |
| 31 | + |
| 32 | +### Prerequisites |
| 33 | + |
| 34 | +- Python 3.9+ |
| 35 | +- Java 11+ (SANY/TLC binaries are downloaded during installation) |
| 36 | +- Anthropic evaluator key (required for trace/invariant workflows) |
| 37 | +- LLM credentials for the model under test (OpenAI, Azure OpenAI, Anthropic, etc.) |
| 38 | + |
| 39 | +### Configuration |
| 40 | + |
| 41 | +Edit `benchmarks/sysmobench/env.toml`: |
| 42 | + |
| 43 | +```toml |
| 44 | +[evaluator_api_keys] |
| 45 | +ANTHROPIC_API_KEY = "your_evaluator_key" |
| 46 | + |
| 47 | +[llm] |
| 48 | +OPENAI_API_KEY = "your_openai_key" |
| 49 | +AZURE_API_KEY = "your_azure_key" |
| 50 | +AZURE_API_BASE = "https://your-azure-endpoint.openai.azure.com" |
| 51 | +AZURE_API_VERSION = "2024-05-01-preview" |
| 52 | +ANTHROPIC_API_KEY = "your_model_key" |
| 53 | + |
| 54 | +[hardware] |
| 55 | +use_gpu = false |
| 56 | + |
| 57 | +[env-docker] |
| 58 | +image = "default" |
| 59 | +entrypoint = "./run.sh" |
| 60 | +``` |
| 61 | + |
| 62 | +### Install Dependencies |
| 63 | + |
| 64 | +```bash |
| 65 | +cd benchmarks/sysmobench |
| 66 | +./install.sh |
| 67 | +``` |
| 68 | + |
| 69 | +This script installs OpenJDK when necessary, creates `.venv/`, installs `benchmarks/sysmobench/requirements.txt`, registers `sysmobench_core` in editable mode (exposing `sysmobench`/`sysmobench-setup`), and downloads the TLA+ toolchain. |
| 70 | + |
| 71 | +### Run the Benchmark |
| 72 | + |
| 73 | +```bash |
| 74 | +./run.sh <model_name> |
| 75 | +``` |
| 76 | + |
| 77 | +The wrapper executes `src/main.py` with the default task list from `data/benchmark/tasks.jsonl`, iterating up to three correction rounds per task. Results are stored at: |
| 78 | + |
| 79 | +``` |
| 80 | +outputs/sysmobench__<model>__agent_based__<timestamp>/ |
| 81 | +├── result.jsonl # Line-delimited iteration summaries per task |
| 82 | +└── avg_score.json # Final averaged score across tasks |
| 83 | +``` |
| 84 | + |
| 85 | +### Use the System Intelligence CLI (optional) |
| 86 | + |
| 87 | +To orchestrate SysMoBench alongside other benchmarks: |
| 88 | + |
| 89 | +```bash |
| 90 | +cd cli |
| 91 | +./run_all_local.sh <model_name> |
| 92 | +``` |
| 93 | + |
| 94 | +### Using the upstream SysMoBench CLI (optional) |
| 95 | + |
| 96 | +```bash |
| 97 | +cd benchmarks/sysmobench |
| 98 | +source .venv/bin/activate |
| 99 | +sysmobench --task spin --method agent_based --model <model> --metric compilation_check |
| 100 | +sysmobench --list-tasks |
| 101 | +deactivate |
| 102 | +``` |
| 103 | + |
| 104 | +For exhaustive CLI flag descriptions, see [Usage Guide](sysmobench_core/docs/Usage.md). |
| 105 | + |
| 106 | +## Benchmark Tasks |
| 107 | + |
| 108 | +SysMoBench includes 9 diverse real-world system artifacts from concurrent and distributed systems: |
| 109 | + |
| 110 | +| System | Type | Description | Source Lang. | Source LoC | TLA+ LoC | |
| 111 | +|--------|------|-------------|--------------|------------|----------| |
| 112 | +| Asterinas Spinlock | Concurrent | Synchronization | Rust | 213 | 151 | |
| 113 | +| Asterinas Mutex | Concurrent | Synchronization | Rust | 186 | 219 | |
| 114 | +| Asterinas Rwmutex | Concurrent | Synchronization | Rust | 395 | 250 | |
| 115 | +| Etcd Raft | Distributed | Consensus (Raft) | Go | 2,159 | 385 | |
| 116 | +| Redis Raft | Distributed | Consensus (Raft) | C | 2,394 | 349 | |
| 117 | +| Xline CURP | Distributed | Replication (CURP) | Rust | 4,064 | 100 | |
| 118 | +| PGo dqueue | Distributed | Distributed Queue | Go | 175 | 75 | |
| 119 | +| PGo locksvc | Distributed | Lock Server | Go | 281 | 93 | |
| 120 | +| PGo raftkvs | Distributed | Consensus (Raft) | Go | 3,163 | 508 | |
| 121 | + |
| 122 | +To list all available tasks: |
| 123 | + |
| 124 | +```bash |
| 125 | +sysmobench --list-tasks |
| 126 | +``` |
| 127 | + |
| 128 | +## Evaluation Metrics |
| 129 | + |
| 130 | +SysMoBench provides four automated phases to evaluate AI-generated TLA+ models with different metrics: |
| 131 | + syntax correctness, runtime correctness, conformance to system implementation, and invariant correctness. |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +## Adding New Systems |
| 137 | + |
| 138 | +SysMoBench is designed to be extensible. To add a new system artifact: |
| 139 | + |
| 140 | +1. **Prepare system artifact**: Collect repository links, branch names, and any relevant materials |
| 141 | +2. **Create task definition**: Specify modeling requirements, task configuration and related files in `task.yaml` and define invariant templates for correctness properties |
| 142 | +3. **Instrument for trace collection**: Add logging statements to system code to collect execution traces for conformance validation |
| 143 | + |
| 144 | +For detailed instructions, see [Adding New Systems Guide](sysmobench_core/docs/add_new_system.md). |
| 145 | + |
| 146 | + |
| 147 | +## Project Structure |
| 148 | + |
| 149 | +``` |
| 150 | +LLM_Gen_TLA_benchmark_framework/ |
| 151 | +├── scripts/ |
| 152 | +│ └── run_benchmark.py # Main entry script for running benchmarks |
| 153 | +├── tla_eval/ |
| 154 | +│ ├── tasks/ # Task definitions for each system artifact |
| 155 | +│ │ ├── spin/ # Spinlock task with prompts, configs |
| 156 | +│ │ │ ├── prompts/ # System-specific prompts |
| 157 | +│ │ │ └── task.yaml # Task configuration (system info) |
| 158 | +│ │ ├── mutex/ |
| 159 | +│ │ └── ... |
| 160 | +│ ├── models/ # LLM model interfaces and wrappers |
| 161 | +│ ├── evaluation/ # Evaluator implementations organized by metric type |
| 162 | +│ └── config.py # Configuration management (API keys, LLM model configs) |
| 163 | +├── data/ |
| 164 | +│ ├── invariant_templates/ # Expert-written invariant templates for each system |
| 165 | +│ └── traces/ # System execution traces for conformance evaluation |
| 166 | +└── lib/ # TLA+ toolchain (tla2tools.jar for SANY and TLC) |
| 167 | +``` |
0 commit comments