Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
83 changes: 60 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,27 @@
# System Intelligence Benchmark: A Benchmark Suite for Evaluating LLM's System Capabilities

It is a comprehensive benchmarking framework for evaluating the performance of Large Language Models (LLMs) and AI systems across critical system capabilities. It features example benchmarks for system course exams, course projects, and cache algorithm design, and offers both CLI tools and an SDK for further development.
System Intelligence Benchmark is a comprehensive benchmark suite for evaluating the performance of Large Language Models (LLMs) and AI systems across critical system capabilities. It features tutorial, example benchmarks and offers both CLI tools and an SDK for further development.

## Benchmark Overview
### Benchmark Concept
A benchmark is a standard or point of reference against which things may be compared or assessed. In the context of AI and LLMs, benchmarks are essential for evaluating model capabilities, guiding research directions, and measuring progress. The following figure illustrates the main components of a AI benchmark. We abstract the benchmark into 4 components: the taskset, the environment, the executor, and the evaluator. This abstraction ensures a clear flow from tasks to metrics. You can see [benchmark_abstraction.md](doc/benchmark_abstract.md) for details.
A benchmark is a standard or point of reference against which things may be compared or assessed. In the context of AI and LLMs, benchmarks are essential for evaluating model capabilities, guiding research directions, and measuring progress.

### Benchmark Framework

To advance benchmark development, we propose the System Intelligence Benchmark, a modular and extensible framework designed to support diverse research domains and problem types. As shown in the below figure, the framework comprises four abstractions: task set, environment, executor, and evaluator. Each task is associated with a specific environment, wherein the executor generates a solution that is subsequently assessed by the evaluator, which returns the evaluation metrics. This design enables the flexible integration of heterogeneous agents and their systematic evaluation. Additionally, the framework includes built-in executors (agents), evaluators (methodologies and grading rubrics), and tutorials. In an ideal case, users need only supply tasks that represent specific capabilities, select an evaluator, and quickly create and run a new benchmark. You can see [benchmark_abstraction.md](doc/benchmark_abstract.md) for details.

<img src="doc/benchmark.png" alt="Dashboard Screenshot" width="600"/>

The benchmark framework is **still under development**. If you have any questions, feel free to open an issue or contact us directly.

### Benchmarks

System Intelligence Benchmark currently includes the following example benchmarks. Some examples are still under development — we're actively updating them. Stay tuned!
- **Course Exam Benchmark** (`benchmarks/course_exam_bench/`) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **Course Project Benchmark** (`benchmarks/course_project_bench/`) - Assesses AI capability on practical system course projects
- **Cache Benchmark** (`benchmarks/cache_bench/`) - Evaluates AI performance on cache algorithm design tasks
- **Example Benchmark** (`benchmarks/example_bench/`) - Template and reference implementation for creating new benchmarks
System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!

- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

## Quick Start
### Repo Structure
Expand All @@ -29,13 +36,15 @@ System Intelligence Benchmark currently includes the following example benchmark
- Python 3.9+
- Docker (optional, for containerized execution)

> Docker images currently only support x86_64/AMD64 architecture. ARM64 (Apple Silicon M1/M2/M3) is not yet supported

### Installation

1. Clone the repository:

```bash
git clone https://github.com/systemintelligence/system_intelligence_benchmark.git
cd system_intelligence_benchmark
git clone https://github.com/sys-intelligence/system-intelligence-benchmark.git
cd system-intelligence-benchmark
```

2. Install dependencies for a specific benchmark:
Expand All @@ -48,13 +57,25 @@ System Intelligence Benchmark currently includes the following example benchmark

### Running Benchmarks

#### Using CLI
#### Run All Benchmarks

To run all benchmarks sequentially:

```bash
cd cli
./run_all_local.sh <model_name>
```

#### Run a Single Benchmark

To run just one benchmark locally:

```bash
cd benchmarks/<benchmark_name>
./install.sh # Only needed the first time
./run.sh <model_name>
```

#### Output Format

Benchmarks generate standardized outputs in `cli/outputs/{benchmark_name}__{model_name}__{agent}_{timestamp}/`:
Expand All @@ -65,27 +86,42 @@ Benchmarks generate standardized outputs in `cli/outputs/{benchmark_name}__{mode

You can find more detailed usage guides in the CLI [README.md](cli/README.md).

## Adding Benchmarks
## Contribute to Benchmarks

We welcome community contributions to enrich existing benchmarks (e.g., by adding more exam problems to the System Exam benchmark and more system artifacts to System Artifact and System Modeling benchmark), port your existing benchmarks, and more importantly to create new system intelligence benchmarks with our framework. See below for detailed instructions. We believe that such collective community efforts will advance AI to its next level and help realize System Intelligence, unlocking the potential of AI-driven computing system innovations. If you are interested in contributing or already have good system benchmarks, please let us know. We have set up a [slack channel](https://join.slack.com/t/sys-intelligence/shared_invite/zt-3hpkgr2aa-NnuPxUbyHr45S89DFi_N1A) at sys-intelligence.slack.com.

> [!NOTE]
> We suggest getting starting by walking through the basic concept of a AI benchmark: [Benchmark Abstraction](doc/benchmark_abstract.md).
> We suggest getting starting by walking through the basic concept of a AI benchmark: [Benchmark Abstraction](doc/benchmark_abstract.md). After understanding the basic concept, you can decide whether to Contribute to Existing Benchmarks, Porting Existing Benchmarks, or Creating New Benchmarks.

### Contribute to Existing Benchmarks
The easiest way to contribute is to add more tasks to existing benchmarks. Currently, the following two are highly recommended. You can simply follow the provided guidelines to submit your data—once that’s done, you’re all set.
- **SystemExam**: If you are a professor teaching one or more courses, we highly recommend contributing **more exam problems** to SystemExam (see [this doc](https://github.com/sys-intelligence/system-intelligence-benchmark/tree/main/benchmarks/course_exam_bench#how-to-extend-the-benchmark) for step-by-step guidance).
- **SystemArtifact**: If you are a researcher submitting artifacts, or an AE chair involved in artifact evaluation, we highly recommend contributing **more system artifacts** to SystemArtifact (see [this doc](https://github.com/sys-intelligence/system-intelligence-benchmark/blob/main/benchmarks/arteval_bench/README.md) for step-by-step guidance).

In addition, you can also help review the existing benchmarks to propose improvement ideas or directly enhance them—for example, by adding more advanced evaluators or incorporating improved metrics.

After understanding the basic concept, you can decide whether to add more tasks for existing benchmarks or create new benchmarks that map to different levels of system capabilities.
### Porting Existing Benchmarks
> [!NOTE]
> See [porting_benchmark.md](doc/porting_benchmark.md) for step-by-step guidelines.

### Contribute to existing Benchmarks
The easiest way to contribute is to add more tasks to existing benchmarks. For example, you can add more questions to the course exam benchmark or more projects to the course project benchmark. You can add more system algorithm design problems into algorithm design benchmark. Please follow the existing format and structure for adding new tasks. You can also improve the existing benchmarks by adding more advanced evaluators with improved metrics.
For integrating existing, independently-developed benchmark projects while maintaining synchronization with upstream:

- Use Git Subtree/Submodule to incorporate upstream code
- Write a bridge layer to connect upstream evaluators with framework SDK
- Configure bidirectional sync for pulling updates and contributing fixes

**Example:** [SysMoBench](benchmarks/sysmobench/) - ported from [SysSpecBench](https://github.com/specula-org/SysSpecBench)

### Creating New Benchmarks
> [!NOTE]
> See [custom_benchmark.md](doc/custom_benchmark.md) for step-by-step guidelines.
> [!NOTE]
> See [custom_benchmark.md](doc/creating_benchmark.md) for step-by-step guidelines.

To create a new benchmark, follow these steps:
1. Create a new benchmark directory in `benchmarks/`
1. Based on your specific requirements, copy an example benchmark as a starting point
2. Update the `src/main.py` file with your specific evaluation logic
3. Update the README.md with benchmark-specific details
4. Add test cases in the `tests/` directory
2. Add an `env.toml` configuration file
1. Based on your specific requirements, select and copy an example benchmark as a starting point
2. Update the `src/main.py` file with your specific evaluation logic (your executor and evaluator)
3. Add test cases in the `tests/` directory
2. Update the README.md with benchmark-specific details
3. Implement `install.sh` and `run.sh` scripts
4. Update the benchmark list in `run_all_local.sh` and `run_docker.sh` if needed

Expand All @@ -110,3 +146,4 @@ trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.

119 changes: 119 additions & 0 deletions benchmarks/arteval_bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# ----------------------
# General
# ----------------------
*.lock
*.log
*.bak
*.pkl
*.png
*.jpg
*.jpeg
*.pdf
*.xls
*.csv
*.doc

# Logs / temp
logs/
log/
*.tmp
*.temp
*.swp
*.swo
*.orig

# Trash / scratch areas
__pycache__/
trash/

# OS files
.DS_Store
Thumbs.db

# ----------------------
# Python
# ----------------------

# Byte-compiled / optimized / DLL files
*.py[cod]
*$py.class

# Virtual environments
.venv/
venv/
env/
ENV/

# Distribution / packaging
build/
dist/
eggs/
*.egg-info/
.eggs/
pip-wheel-metadata/
*.whl

# Test / coverage
.pytest_cache/
.coverage
.coverage.*
htmlcov/
.tox/
.nox/

# Type checking / tooling
.mypy_cache/
.pyre/
.pytype/

# ----------------------
# Java
# ----------------------

# Compiled files
*.class

# Build outputs
target/
bin/
out/

# Maven / Gradle
.mvn/
.settings/
.gradle/
build/

# IDE project files
*.iml
.idea/
.project
.classpath

# Archives
*.jar
*.war
*.ear

# ----------------------
# C / C++
# ----------------------

# Object / compiled files
*.o
*.obj
*.so
*.dll
*.dylib
*.a
*.lib
*.lo

# Executables
a.out
*.exe
*.out

# Build directories
build/
cmake-build-*/
Loading