You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/creating_benchmark.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ Before creating a custom benchmark, ensure you have:
14
14
15
15
Choose an example benchmark that **is similar to** your setting as a starting point.
16
16
17
-
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [algo_cache_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/algo_cache_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
17
+
If your tasks involve exam-style questions, consider starting from [course_exam_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench). If your benchmark focuses on algorithm design or optimization tasks, you might use [cache_algo_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/cache_algo_bench) as a template. These tasks can often be handled by a minimal agent (an LLM call plus a response parser).
18
18
19
19
Use [course_lab_bench](https://github.com/sys-intelligence/system_intelligence_benchmark/tree/main/benchmarks/course_exam_bench), if your benchmark is related to **environment setup, system understanding/implementation, performance analysis, or debugging tasks**, and each task may need different runing environments. These tasks typically require an LLM to autonomously call tools (such as the File Editor, Bash, etc.), navigate a large codebase, and run experiments or tests—similar to what a human developer would do. To support this, we provide several advanced agents (e.g., Claude Code, MiniSWEAgent) in this example, along with guidance for [integrating new agents](https://github.com/sys-intelligence/system_intelligence_benchmark/blob/main/benchmarks/course_lab_bench/add_agents.md).
20
20
@@ -254,7 +254,7 @@ class CustomEvaluator(Evaluator):
254
254
255
255
-**`example_bench/src/main.py`**: Uses `SimpleExecutor` + `BasicEvaluator` for basic evaluation with multiple similarity metrics
256
256
-**`course_exam_bench/`**: Uses `SimpleExecutor` + `ExamEvaluator` for grading exam questions
257
-
-**`algo_cache_bench/`**: Uses custom evaluator for code execution and performance testing
257
+
-**`cache_algo_bench/`**: Uses custom evaluator (cache_simulator) for code execution and performance testing
258
258
-**`course_lab_bench/`**: Uses agent-based executor for complex project execution
259
259
260
260
## Step 4: Configure Your Benchmark
@@ -437,7 +437,7 @@ Follow the [PreChecks.md](PreChecks.md) for code formatting and linting guidelin
437
437
Refer to existing benchmarks for inspiration:
438
438
439
439
-**`example_bench/`**: Minimal template with `SimpleExecutor` + `BasicEvaluator`
440
-
-**`algo_cache_bench/`**: Code execution, algorithm simulation and performance evaluation
440
+
-**`cache_algo_bench/`**: Code execution, algorithm simulation and performance evaluation
441
441
-**`course_exam_bench/`**: Multiple-choice and short-answer questions with `ExamEvaluator`
442
442
-**`course_lab_bench/`**: Complex project-based evaluation with agent executors
0 commit comments