You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ The benchmark framework is **still under development**. If you have any question
18
18
System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!
19
19
20
20
-**System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
21
-
-**System Lab Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
21
+
-**System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
22
22
-**System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
23
23
-**System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
24
24
-**Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
Copy file name to clipboardExpand all lines: doc/creating_benchmark.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,9 +14,9 @@ Before creating a custom benchmark, ensure you have:
14
14
15
15
Choose an example benchmark that **is similar to** your setting as a starting point.
16
16
17
-
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
17
+
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [algo_cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/algo_cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
18
18
19
-
Use [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/add_agents.md).
19
+
Use [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/add_agents.md).
20
20
21
21
1. Navigate to the benchmarks directory:
22
22
@@ -72,7 +72,7 @@ Create your evaluation dataset in a structured format:
72
72
-`user_prompt`: User query/task description
73
73
-`response`: Expected/ground truth response
74
74
75
-
3.**NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/data/benchmark/env_setup_examples.jsonl) for examples.
75
+
3.**NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/env_setup_examples.jsonl) for examples.
76
76
77
77
## Step 3: Select or Implement Your Executor and Evaluator
78
78
@@ -254,8 +254,8 @@ class CustomEvaluator(Evaluator):
254
254
255
255
-**`example_bench/src/main.py`**: Uses `SimpleExecutor` + `BasicEvaluator` for basic evaluation with multiple similarity metrics
256
256
-**`course_exam_bench/`**: Uses `SimpleExecutor` + `ExamEvaluator` for grading exam questions
257
-
-**`cache_bench/`**: Uses custom evaluator for code execution and performance testing
258
-
-**`course_project_bench/`**: Uses agent-based executor for complex project execution
257
+
-**`algo_cache_bench/`**: Uses custom evaluator for code execution and performance testing
258
+
-**`course_lab_bench/`**: Uses agent-based executor for complex project execution
259
259
260
260
## Step 4: Configure Your Benchmark
261
261
@@ -437,9 +437,9 @@ Follow the [PreChecks.md](PreChecks.md) for code formatting and linting guidelin
437
437
Refer to existing benchmarks for inspiration:
438
438
439
439
-**`example_bench/`**: Minimal template with `SimpleExecutor` + `BasicEvaluator`
440
-
-**`cache_bench/`**: Code execution, algorithm simulation and performance evaluation
440
+
-**`algo_cache_bench/`**: Code execution, algorithm simulation and performance evaluation
441
441
-**`course_exam_bench/`**: Multiple-choice and short-answer questions with `ExamEvaluator`
442
-
-**`course_project_bench/`**: Complex project-based evaluation with agent executors
442
+
-**`course_lab_bench/`**: Complex project-based evaluation with agent executors
0 commit comments