Update links for changed benchmark names

xuafeng · xuafeng · commit ccfc9836a19c · 2025-11-17T16:23:29.000-08:00
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ The benchmark framework is **still under development**. If you have any question
 System Intelligence Benchmark currently includes the following example benchmarks. Each benchmark assesses specific capabilities across multiple levels within a given research direction. Some benchmarks are still under development — we're actively updating them. Stay tuned!
 
 - **System Exam Benchmark** ([benchmarks/course_exam_bench/](benchmarks/course_exam_bench/)) - Tests LLM understanding of system concepts through university course exams (54 questions across 4 exams)
-- **System Lab Benchmark** ([benchmarks/course_project_bench/](benchmarks/course_project_bench/)) - Assesses AI capability on practical system course projects 
+- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects 
 - **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
 - **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
 - **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
diff --git a/doc/creating_benchmark.md b/doc/creating_benchmark.md
@@ -14,9 +14,9 @@ Before creating a custom benchmark, ensure you have:
 
 Choose an example benchmark that **is similar to** your setting as a starting point. 
 
-If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
+If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [algo_cache_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/algo_cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
 
-Use [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/add_agents.md).
+Use [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/add_agents.md).
 
 1. Navigate to the benchmarks directory:
 
@@ -72,7 +72,7 @@ Create your evaluation dataset in a structured format:
    - `user_prompt`: User query/task description
    - `response`: Expected/ground truth response
 
-3. **NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_project_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_project_bench/data/benchmark/env_setup_examples.jsonl) for examples.
+3. **NOTES:** for more complex scenarios, you can use **any custom formats**. See [course_exam_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_exam_bench/data/benchmark/questions.jsonl) and [course_lab_bench](https://github.com/systemintelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/data/benchmark/env_setup_examples.jsonl) for examples.
 
 ## Step 3: Select or Implement Your Executor and Evaluator
 
@@ -254,8 +254,8 @@ class CustomEvaluator(Evaluator):
 
 - **`example_bench/src/main.py`**: Uses `SimpleExecutor` + `BasicEvaluator` for basic evaluation with multiple similarity metrics
 - **`course_exam_bench/`**: Uses `SimpleExecutor` + `ExamEvaluator` for grading exam questions
-- **`cache_bench/`**: Uses custom evaluator for code execution and performance testing
-- **`course_project_bench/`**: Uses agent-based executor for complex project execution
+- **`algo_cache_bench/`**: Uses custom evaluator for code execution and performance testing
+- **`course_lab_bench/`**: Uses agent-based executor for complex project execution
 
 ## Step 4: Configure Your Benchmark
 
@@ -437,9 +437,9 @@ Follow the [PreChecks.md](PreChecks.md) for code formatting and linting guidelin
 Refer to existing benchmarks for inspiration:
 
 - **`example_bench/`**: Minimal template with `SimpleExecutor` + `BasicEvaluator`
-- **`cache_bench/`**: Code execution, algorithm simulation and performance evaluation
+- **`algo_cache_bench/`**: Code execution, algorithm simulation and performance evaluation
 - **`course_exam_bench/`**: Multiple-choice and short-answer questions with `ExamEvaluator`
-- **`course_project_bench/`**: Complex project-based evaluation with agent executors
+- **`course_lab_bench/`**: Complex project-based evaluation with agent executors
 
 ### Getting Help