Skip to content

Commit 1b84b81

Browse files
committed
fix: correct benchmark references in creating_benchmark.md
1 parent a1a6bb6 commit 1b84b81

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

doc/creating_benchmark.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Before creating a custom benchmark, ensure you have:
1414

1515
Choose an example benchmark that **is similar to** your setting as a starting point.
1616

17-
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [algo_cache_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/algo_cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
17+
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [cache_algo_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/cache_algo_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
1818

1919
Use [course_lab_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/sys-intelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/add_agents.md).
2020

@@ -254,7 +254,7 @@ class CustomEvaluator(Evaluator):
254254

255255
- **`example_bench/src/main.py`**: Uses `SimpleExecutor` + `BasicEvaluator` for basic evaluation with multiple similarity metrics
256256
- **`course_exam_bench/`**: Uses `SimpleExecutor` + `ExamEvaluator` for grading exam questions
257-
- **`algo_cache_bench/`**: Uses custom evaluator for code execution and performance testing
257+
- **`cache_algo_bench/`**: Uses custom evaluator (cache_simulator) for code execution and performance testing
258258
- **`course_lab_bench/`**: Uses agent-based executor for complex project execution
259259

260260
## Step 4: Configure Your Benchmark
@@ -437,7 +437,7 @@ Follow the [PreChecks.md](PreChecks.md) for code formatting and linting guidelin
437437
Refer to existing benchmarks for inspiration:
438438

439439
- **`example_bench/`**: Minimal template with `SimpleExecutor` + `BasicEvaluator`
440-
- **`algo_cache_bench/`**: Code execution, algorithm simulation and performance evaluation
440+
- **`cache_algo_bench/`**: Code execution, algorithm simulation and performance evaluation
441441
- **`course_exam_bench/`**: Multiple-choice and short-answer questions with `ExamEvaluator`
442442
- **`course_lab_bench/`**: Complex project-based evaluation with agent executors
443443

0 commit comments

Comments
 (0)