Skip to content

Commit 1577aca

Browse files
authored
Merge pull request #11 from systemintelligence/docs_course_exam_bench
Course Exam Benchmark: Restructure Data Format
2 parents ea9b54d + 4d18658 commit 1577aca

20 files changed

+949
-490
lines changed

.github/workflows/test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ jobs:
1818
matrix:
1919
benchmark:
2020
- example_bench
21+
- course_exam_bench
2122
# TODO: For now, we comment out other benchmarks as they have no tests
2223
# - arteval_bench
2324
# - cache_bench
24-
# - course_exam_bench
2525
# - course_project_bench
2626

2727
steps:
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Evaluation Results
2+
3+
| Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/median) |
4+
| ---------------------------------------------------------- | -------------- | ----------------------------- | ---------------------------- | ----------------------------- | ---------------------------------- |
5+
| 6.5840 Distributed System Engineering: Spring 2025 Exam I | 11 | 29/65 | 27/65 | 25/65 | 65/ **51.8** /52 |
6+
| 6.5840 Distributed System Engineering: Spring 2024 Exam I | 15 | 54/95 | 55/95 | 42/95 | 95/ **77** /78 |
7+
| 6.5840 Distributed System Engineering: Spring 2024 Exam II | 14 | 24/71 | 24/71 | 36/71 | 72/ **56.6** /57 |
8+
| 6.1810 Fall 2024 MIT 6.1810 Operating System Engineering | 14 | 35/70 | 40/70 | 52/70 | 65/ **49.8** /49 |
Lines changed: 162 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,91 +1,196 @@
1-
# Sytsem Course Exam Benchmark
1+
# Course Exam Benchmark
22

3-
## Introduction
3+
This benchmark evaluates the performance of Large Language Models (LLMs) on system course exams.
44

5-
This benchmark evaluates the performance of Large Language Models (LLMs) on system course exams. Currently, this benchmark includes 4 exams from MIT, in total 54 questions, covering various topics such as operating system and distributed system. It contains single-choice questions, multiple-choice questions, and short-answer questions. The questions are designed to test the understanding of system concepts and problem-solving skills.
5+
- 69 questions from 5 MIT exams
6+
- Question types: Single-choice, multiple-choice, true/false, and short-answer
7+
- Includes real student performance data for comparison
68

7-
## Task Details
9+
For current model evaluation results, see [EVALUATION_RESULTS.md](EVALUATION_RESULTS.md).
810

9-
- **Input**: The questions in the system course exams. It include single-choice questions, multiple-choice questions, and short-answer questions.
10-
- **Output**: The answers to the questions. The output can be in the form of selected options for single-choice and multiple-choice questions, and detailed explanations for short-answer questions. For single-choice and multiple-choice questions, the output should be the selected option(s) (e.g., "A", "B", "C", etc.). For short-answer questions, the output should be a detailed explanation or answer to the question.
11+
| Exam | Questions | Topics |
12+
| ------------------------------ | --------- | ------------------- |
13+
| MIT 6.5840 Spring 2025 Exam I | 11 | Distributed Systems |
14+
| MIT 6.5840 Spring 2025 Exam II | 15 | Distributed Systems |
15+
| MIT 6.5840 Spring 2024 Exam I | 15 | Distributed Systems |
16+
| MIT 6.5840 Spring 2024 Exam II | 14 | Distributed Systems |
17+
| MIT 6.1810 Fall 2024 Quiz II | 14 | Operating Systems |
1118

12-
- **Evaluation**: For single-choice and multiple-choice questions, the evaluation is to compare the selected option(s) with the ground truth answers provided in the exam papers. The evaluation is binary: correct or incorrect. For multiple-choice questions, partial credit can be given if some of the selected options are correct. For short-answer questions, the evaluation is based on the correctness and completeness of the answer, which can be subjective and may require human evaluation or a predefined rubric. We use LLM combined with human defined rubric to evaluate the short-answer questions.
19+
## Quick Start
1320

14-
## Eval Results
21+
### 1. Install dependencies
1522

16-
You can see the detailed information of each exam in the table below.
23+
```bash
24+
./install.sh
25+
```
1726

18-
| Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/medium) |
19-
|------------------------------------------------------------------------|----------------|-------------------------------|------------------------------|-------------------------------|-------------------------------------|
20-
| 6.5840 Distributed System Engineering: Spring 2025 Exam I | 11 | 29/65 | 27/65 | 25/65 | 65/ **51.8** /52 |
21-
| 6.5840 Distributed System Engineering: Spring 2024 Exam I | 15 | 54/95 | 55/95 | 42/95 | 95/ **77** /78 |
22-
| 6.5840 Distributed System Engineering: Spring 2024 Exam II | 14 | 24/71 | 24/71 | 36/71 | 72/ **56.6** /57 |
23-
| 6.1810 Fall 2024 MIT 6.1810 Operating System Engineering | 14 | 35/70 | 40/70 | 52/70 | 65/ **49.8** /49 |
27+
This creates a Python virtual environment and installs required packages
2428

25-
## Benchmark Setup
29+
### 2. Configure your LLM endpoint
2630

27-
### Test in Docker
31+
Edit `env.toml` to add your API keys:
2832

29-
To test your benchmark in a Docker container, follow these steps:
33+
```toml
34+
[llm]
35+
AZURE_API_KEY = "your-key-here"
36+
AZURE_API_BASE = "https://your-endpoint.openai.azure.com/"
37+
# or
38+
ANTHROPIC_API_KEY = "your-key-here"
39+
```
3040

31-
1. Build the Docker image using the provided Dockerfile. You can do this by running the following command in the terminal:
41+
### 3. Run the benchmark
3242

33-
```sh
34-
docker build -t your_benchmark_image .
35-
```
43+
```bash
44+
./run.sh "gpt-4o"
45+
```
46+
47+
Or run directly with Python:
3648

37-
2. Once the image is built, you can run it using the following command:
49+
```bash
50+
source .venv/bin/activate
51+
python src/main.py --model_name "gpt-4o"
52+
```
3853

39-
```sh
40-
docker run -it --rm your_benchmark_image
41-
# docker run --rm your_benchmark_image
42-
```
54+
### 4. Run tests
4355

44-
3. Inside the container, navigate to the appropriate directory and execute the benchmark script to start the testing process.
56+
```bash
57+
./test.sh
58+
```
4559

46-
```sh
47-
./run.sh
48-
```
60+
## How it works
4961

50-
### Maunaly Test
62+
1. Load questions: Reads exam questions from `data/benchmark/`
63+
2. For each question:
64+
- Prompts the LLM with the question
65+
- Parses the LLM's JSON response
66+
- Evaluates the answer (exact match for multiple-choice, LLM-as-judge for short-answer)
67+
- Records the score
68+
3. Generate summary: Aggregates results by exam and overall
5169

52-
To manually test your benchmark, follow these steps:
70+
## Output files
5371

54-
#### Install Dependencies
72+
After running, you'll find results in `./outputs/course_exam__<model>__<timestamp>/`:
5573

56-
To install and configure your benchmark, follow these steps:
74+
### 1. Per-question results (`results.jsonl`)
5775

58-
1. configure `env.toml` to set LLM API endpoint
59-
2. install dependencies
76+
For each question, one JSON object per line:
6077

61-
```bash
62-
./install.sh
78+
```json
79+
{
80+
"instance_id": 1,
81+
"exam_id": "6_1810_operating_system_engineering_fall_2024_quiz_ii",
82+
"question_type": "SingleChoice",
83+
"llm_answer": "C",
84+
"correct_answer": "C",
85+
"points_earned": 5,
86+
"points_possible": 5,
87+
"status": "correct"
88+
}
6389
```
6490

65-
#### Run
91+
Fields:
6692

67-
To run your benchmark and obtain results for a specific task and model, follow these steps:
93+
- `instance_id`: Question identifier
94+
- `exam_id`: Exam identifier (links to exams_metadata.json)
95+
- `question_type`: Type of question (`SingleChoice`, `MultipleChoice`, `True/False Questions`, `ShortAnswerQuestion`)
96+
- `llm_answer`: LLM's answer
97+
- `correct_answer`: Correct answer
98+
- `points_earned`: Points the LLM earned
99+
- `points_possible`: Maximum points for this question
100+
- `status`: `correct`, `incorrect`, `partial`, or `error`
68101

69-
1. Review the `run.sh` script to understand the expected commands and parameters.
70-
2. Execute the `run.sh` script to start the benchmark. The script will guide you through the process and generate the results.
102+
### 2. Full debugging information (`results_detailed.jsonl`)
71103

72-
```bash
73-
./run.sh "gpt-4o"
74-
```
104+
Extended format with prompts and LLM explanations (for debugging).
75105

76-
or
106+
### 3. Aggregated statistics (`summary.json`)
77107

78-
```bash
79-
python3 src/main.py --model_name $MODEL_NAME # default output: ./outputs/system_course)bench___${MODEL_NAME}___$(date +"%Y-%m-%d_%H-%M-%S")
108+
Overall performance and breakdown by exam with answered/unanswered/correct/incorrect counts.
109+
110+
### 4. LLM vs student performance (`comparison.json`)
111+
112+
Compares LLM performance against real student baseline data.
80113

81-
# or specify the save path
82-
python3 src/main.py --model_name $MODEL_NAME --save_path ./outputs/BAISysEducation___${MODEL_NAME}___$(date +"%Y-%m-%d_%H-%M-%S")
114+
## Data format
115+
116+
The benchmark data is stored in `data/benchmark/`:
117+
118+
- `exams_metadata.json`: Exam-level metadata (one entry per exam)
119+
- `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`)
120+
121+
## How to extend the benchmark
122+
123+
Consider this [MIT 6.824 Distributed Systems quiz](https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf). The steps below show how to add this exam to the benchmark. The same process applies to any course exam you want to include.
124+
125+
### Step 1: Add exam metadata to `exams_metadata.json`
126+
127+
Create a unique `exam_id` for your exam. Here's the actual entry for the Spring 2024 Exam II:
128+
129+
```json
130+
{
131+
"exam_id": "6_5840_distributed_system_engineering_spring_2024_exam_ii",
132+
"test_paper_name": "6.5840 Distributed System Engineering: Spring 2024 Exam II",
133+
"course": "Distributed System Engineering",
134+
"year": 2024,
135+
"score_total": 71,
136+
"score_max": 71.0,
137+
"score_avg": 56.61,
138+
"score_median": 57,
139+
"score_standard_deviation": 9.13,
140+
"num_questions": 14
141+
}
142+
```
143+
144+
### Step 2: Add individual questions to `questions.jsonl`
145+
146+
Append your questions to the file. Each line is a JSON object. Here's an example from the exam (a True/False question about FaRM):
147+
148+
```json
149+
{
150+
"instance_id": 33,
151+
"exam_id": "6_5840_distributed_system_engineering_spring_2024_exam_ii",
152+
"problem_num": 4,
153+
"points": 8,
154+
"problem": "# III FaRM \n\nConsider the following statements about FaRM as described in No compromises: distributed transactions with consistency, availability, and performance. For each statement, circle True or False. \n\n4. [8 points]: \n\nTrue / False : Because FaRM uses primary-backup replication for a region (instead of Paxos), FaRM must reconfigure to remove a failed replica before FaRM can continue to use the region. \n\nTrue / False : FaRM can use short leases (10ms by default) because it has communication and scheduling optimizations to renew leases quickly. \n\nTrue / False : A transaction that modifies only one object will never abort. \n\nTrue / False : Read-only transactions require only the validate step of the Commit phase in Figure 4. ",
155+
"answer": "True,True,False,True",
156+
"explanation": "Answer: True, True, False, True. The first statement is true because FaRM requires a response from all replicas, thus it must reconfigure to remove the failed replica before it can continue with the affected shard. The third statement is false because another transaction may modify the one object causing this transaction's validation phase to fail (because the other transaction will have incremented the object's version number).",
157+
"type": "True/False Questions"
158+
}
83159
```
84160

85-
### Output Description
161+
Required fields:
162+
163+
- `instance_id`: Globally unique number (use next available number)
164+
- `exam_id`: Must match the `exam_id` from Step 1
165+
- `problem_num`: Question number within the exam (1, 2, 3, ...)
166+
- `points`: Points allocated to this question
167+
- `problem`: The question text
168+
- `answer`: Correct answer
169+
- For SingleChoice: `"A"`, `"B"`, etc.
170+
- For MultipleChoice: `"A,B,C"` (comma-separated, no spaces)
171+
- For True/False: `"True,False,True"` (one per sub-question)
172+
- For ShortAnswerQuestion: The model answer text
173+
- `explanation`: Explanation of the correct answer
174+
- `type`: One of `"SingleChoice"`, `"MultipleChoice"`, `"True/False Questions"`, `"ShortAnswerQuestion"`
175+
176+
> Note: Questions should be sorted by `exam_id` then `instance_id`
177+
178+
After adding the exam and questions, run `./test.sh` as a sanity check to valid the data format. This will also run in the CI pipeline.
179+
180+
## Question types and evaluation
181+
182+
| Type | Answer Format | Evaluation Method | Partial Credit? |
183+
| -------------------- | ------------------- | ----------------- | ---------------------------------- |
184+
| SingleChoice | `"A"` | Exact match | No |
185+
| MultipleChoice | `"A,B,C"` | Subset check | Yes (2 points for partial correct) |
186+
| True/False Questions | `"True,False,True"` | Exact match | No |
187+
| ShortAnswerQuestion | Free text | LLM-as-judge | Yes (scored 0 to max points) |
188+
189+
For short-answer questions, an LLM evaluates the answer based on accuracy, completeness, logical consistency, and clarity.
190+
191+
## Training data templates
192+
193+
See the example files in:
86194

87-
- `result.jsonl`: Detailed output information
88-
- `summary.json`: Summary of model results
89-
- `reference`: Original test scores (ground truth student performance)
90-
- `score`: Test scores
91-
- `score_by_test_paper`: Test score by test paper
195+
- `data/sft/course_exam_sft_example.jsonl`: Format for supervised fine-tuning
196+
- `data/pretrain/course_exam_pretrain_example.jsonl`: Format for pre-training

0 commit comments

Comments
 (0)