Skip to content

Commit ccfc983

Browse files
committed
Update links for changed benchmark names
1 parent bc2ce10 commit ccfc983

File tree

2 files changed

+8
-8
lines changed

2 files changed

+8
-8
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The benchmark framework is **still under development**. If you have any question
1818
System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!
1919

2020
- **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
21-
- **System Lab Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects
21+
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
2222
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
2323
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
2424
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

doc/creating_benchmark.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ Before creating a custom benchmark, ensure you have:
1414

1515
Choose an example benchmark that **is similar to** your setting as a starting point.
1616

17-
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
17+
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [algo_cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/algo_cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
1818

19-
Use [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/add_agents.md).
19+
Use [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/add_agents.md).
2020

2121
1. Navigate to the benchmarks directory:
2222

@@ -72,7 +72,7 @@ Create your evaluation dataset in a structured format:
7272
- `user_prompt`: User query/task description
7373
- `response`: Expected/ground truth response
7474

75-
3. **NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/data/benchmark/env_setup_examples.jsonl) for examples.
75+
3. **NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/env_setup_examples.jsonl) for examples.
7676

7777
## Step 3: Select or Implement Your Executor and Evaluator
7878

@@ -254,8 +254,8 @@ class CustomEvaluator(Evaluator):
254254

255255
- **`example_bench/src/main.py`**: Uses `SimpleExecutor` + `BasicEvaluator` for basic evaluation with multiple similarity metrics
256256
- **`course_exam_bench/`**: Uses `SimpleExecutor` + `ExamEvaluator` for grading exam questions
257-
- **`cache_bench/`**: Uses custom evaluator for code execution and performance testing
258-
- **`course_project_bench/`**: Uses agent-based executor for complex project execution
257+
- **`algo_cache_bench/`**: Uses custom evaluator for code execution and performance testing
258+
- **`course_lab_bench/`**: Uses agent-based executor for complex project execution
259259

260260
## Step 4: Configure Your Benchmark
261261

@@ -437,9 +437,9 @@ Follow the [PreChecks.md](PreChecks.md) for code formatting and linting guidelin
437437
Refer to existing benchmarks for inspiration:
438438

439439
- **`example_bench/`**: Minimal template with `SimpleExecutor` + `BasicEvaluator`
440-
- **`cache_bench/`**: Code execution, algorithm simulation and performance evaluation
440+
- **`algo_cache_bench/`**: Code execution, algorithm simulation and performance evaluation
441441
- **`course_exam_bench/`**: Multiple-choice and short-answer questions with `ExamEvaluator`
442-
- **`course_project_bench/`**: Complex project-based evaluation with agent executors
442+
- **`course_lab_bench/`**: Complex project-based evaluation with agent executors
443443

444444
### Getting Help
445445

0 commit comments

Comments
 (0)