Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ jobs:
matrix:
benchmark:
- example_bench
- course_exam_bench
# TODO: For now, we comment out other benchmarks as they have no tests
# - arteval_bench
# - cache_bench
# - course_exam_bench
# - course_project_bench

steps:
Expand Down
215 changes: 158 additions & 57 deletions benchmarks/course_exam_bench/README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,192 @@
# Sytsem Course Exam Benchmark
# Course Exam Benchmark

## Introduction
This benchmark evaluates the performance of Large Language Models (LLMs) on system course exams.

This benchmark evaluates the performance of Large Language Models (LLMs) on system course exams. Currently, this benchmark includes 4 exams from MIT, in total 54 questions, covering various topics such as operating system and distributed system. It contains single-choice questions, multiple-choice questions, and short-answer questions. The questions are designed to test the understanding of system concepts and problem-solving skills.
- 69 questions from 5 MIT exams
- Question types: Single-choice, multiple-choice, true/false, and short-answer
- Includes real student performance data for comparison

## Task Details
| Exam | Questions | Topics |
| ------------------------------ | --------- | ------------------- |
| MIT 6.5840 Spring 2025 Exam I | 11 | Distributed Systems |
| MIT 6.5840 Spring 2025 Exam II | 15 | Distributed Systems |
| MIT 6.5840 Spring 2024 Exam I | 15 | Distributed Systems |
| MIT 6.5840 Spring 2024 Exam II | 14 | Distributed Systems |
| MIT 6.1810 Fall 2024 Quiz II | 14 | Operating Systems |

- **Input**: The questions in the system course exams. It include single-choice questions, multiple-choice questions, and short-answer questions.
- **Output**: The answers to the questions. The output can be in the form of selected options for single-choice and multiple-choice questions, and detailed explanations for short-answer questions. For single-choice and multiple-choice questions, the output should be the selected option(s) (e.g., "A", "B", "C", etc.). For short-answer questions, the output should be a detailed explanation or answer to the question.
## Quick Start

- **Evaluation**: For single-choice and multiple-choice questions, the evaluation is to compare the selected option(s) with the ground truth answers provided in the exam papers. The evaluation is binary: correct or incorrect. For multiple-choice questions, partial credit can be given if some of the selected options are correct. For short-answer questions, the evaluation is based on the correctness and completeness of the answer, which can be subjective and may require human evaluation or a predefined rubric. We use LLM combined with human defined rubric to evaluate the short-answer questions.
### 1. Install dependencies

## Eval Results
```bash
./install.sh
```

You can see the detailed information of each exam in the table below.
This creates a Python virtual environment and installs required packages

| Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/medium) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we can have one separate md file to show the current measurement results. People may have interest in the results.

|------------------------------------------------------------------------|----------------|-------------------------------|------------------------------|-------------------------------|-------------------------------------|
| 6.5840 Distributed System Engineering: Spring 2025 Exam I | 11 | 29/65 | 27/65 | 25/65 | 65/ **51.8** /52 |
| 6.5840 Distributed System Engineering: Spring 2024 Exam I | 15 | 54/95 | 55/95 | 42/95 | 95/ **77** /78 |
| 6.5840 Distributed System Engineering: Spring 2024 Exam II | 14 | 24/71 | 24/71 | 36/71 | 72/ **56.6** /57 |
| 6.1810 Fall 2024 MIT 6.1810 Operating System Engineering | 14 | 35/70 | 40/70 | 52/70 | 65/ **49.8** /49 |
### 2. Configure your LLM endpoint

## Benchmark Setup
Edit `env.toml` to add your API keys:

### Test in Docker
```toml
[llm]
AZURE_API_KEY = "your-key-here"
AZURE_API_BASE = "https://your-endpoint.openai.azure.com/"
# or
ANTHROPIC_API_KEY = "your-key-here"
```

To test your benchmark in a Docker container, follow these steps:
### 3. Run the benchmark

1. Build the Docker image using the provided Dockerfile. You can do this by running the following command in the terminal:
```bash
./run.sh "gpt-4o"
```

```sh
docker build -t your_benchmark_image .
```
Or run directly with Python:

2. Once the image is built, you can run it using the following command:
```bash
source .venv/bin/activate
python src/main.py --model_name "gpt-4o"
```

```sh
docker run -it --rm your_benchmark_image
# docker run --rm your_benchmark_image
```
### 4. Run tests

3. Inside the container, navigate to the appropriate directory and execute the benchmark script to start the testing process.
```bash
./test.sh
```

```sh
./run.sh
```
## How it works

### Maunaly Test
1. Load questions: Reads exam questions from `data/benchmark/`
2. For each question:
- Prompts the LLM with the question
- Parses the LLM's JSON response
- Evaluates the answer (exact match for multiple-choice, LLM-as-judge for short-answer)
- Records the score
3. Generate summary: Aggregates results by exam and overall

To manually test your benchmark, follow these steps:
## Output files

#### Install Dependencies
After running, you'll find results in `./outputs/course_exam__<model>__<timestamp>/`:

To install and configure your benchmark, follow these steps:
### 1. Per-question results (`results.jsonl`)

1. configure `env.toml` to set LLM API endpoint
2. install dependencies
For each question, one JSON object per line:

```bash
./install.sh
```json
{
"instance_id": 1,
"exam_id": "6_1810_fall_2024_quiz_ii_solutions",
"question_type": "SingleChoice",
"llm_answer": "C",
"correct_answer": "C",
"points_earned": 5,
"points_possible": 5,
"status": "correct"
}
```

#### Run
Fields:

To run your benchmark and obtain results for a specific task and model, follow these steps:
- `instance_id`: Question identifier
- `exam_id`: Exam identifier (links to exams_metadata.json)
- `question_type`: Type of question (`SingleChoice`, `MultipleChoice`, `True/False Questions`, `ShortAnswerQuestion`)
- `llm_answer`: LLM's answer
- `correct_answer`: Correct answer
- `points_earned`: Points the LLM earned
- `points_possible`: Maximum points for this question
- `status`: `correct`, `incorrect`, `partial`, or `error`

1. Review the `run.sh` script to understand the expected commands and parameters.
2. Execute the `run.sh` script to start the benchmark. The script will guide you through the process and generate the results.
### 2. Full debugging information (`results_detailed.jsonl`)

```bash
./run.sh "gpt-4o"
```
Extended format with prompts and LLM explanations (for debugging).

or
### 3. Aggregated statistics (`summary.json`)

```bash
python3 src/main.py --model_name $MODEL_NAME # default output: ./outputs/system_course)bench___${MODEL_NAME}___$(date +"%Y-%m-%d_%H-%M-%S")
Overall performance and breakdown by exam with answered/unanswered/correct/incorrect counts.

### 4. LLM vs student performance (`comparison.json`)

Compares LLM performance against real student baseline data.

# or specify the save path
python3 src/main.py --model_name $MODEL_NAME --save_path ./outputs/BAISysEducation___${MODEL_NAME}___$(date +"%Y-%m-%d_%H-%M-%S")
## Data format

The benchmark data is stored in `data/benchmark/`:

- `exams_metadata.json`: Exam-level metadata (one entry per exam)
- `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`)

## How to extend the benchmark

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Tarek. All is very clear and well-organized.
One small suggestion: can we add some sentences to set the prerequisites: assume we already have one Course Exam like this https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf. And then, all the following steps is based on this example quiz.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

### Step 1: Add exam metadata to `exams_metadata.json`

Create a unique `exam_id` for your exam:

```json
{
"exam_id": "your_university_course_year_semester_exam",
"test_paper_name": "Your University Course Name: Semester Year Exam",
"course": "Course Name",
"year": 2025,
"score_total": 100,
"score_max": 95.0,
"score_avg": 75.0,
"score_median": 77.0,
"score_standard_deviation": 10.5,
"num_questions": 10
}
```

### Step 2: Add individual questions to `questions.jsonl`

Append your questions to the file. Each line is a JSON object:

```json
{
"instance_id": 70,
"exam_id": "your_university_course_year_semester_exam",
"problem_num": 1,
"points": 10,
"problem": "Explain the difference between a process and a thread.",
"answer": "A process is an instance of a running program with its own memory space, while a thread is a unit of execution within a process that shares the process's memory.",
"explanation": "Full explanation here...",
"type": "ShortAnswerQuestion"
}
```

### Output Description
Required fields:

- `instance_id`: Globally unique number (use next available number, currently 70+)
- `exam_id`: Must match the `exam_id` from Step 1
- `problem_num`: Question number within the exam (1, 2, 3, ...)
- `points`: Points allocated to this question
- `problem`: The question text
- `answer`: Correct answer
- For SingleChoice: `"A"`, `"B"`, etc.
- For MultipleChoice: `"A,B,C"` (comma-separated, no spaces)
- For True/False: `"True,False,True"` (one per sub-question)
- For ShortAnswerQuestion: The model answer text
- `explanation`: Explanation of the correct answer
- `type`: One of `"SingleChoice"`, `"MultipleChoice"`, `"True/False Questions"`, `"ShortAnswerQuestion"`

> Note: Questions should be sorted by `exam_id` then `instance_id`

After adding the exam and questions, run `./test.sh` as a sanity check to valid the data format. This will also run in the CI pipeline.

## Question types and evaluation

| Type | Answer Format | Evaluation Method | Partial Credit? |
| -------------------- | ------------------- | ----------------- | ---------------------------------- |
| SingleChoice | `"A"` | Exact match | No |
| MultipleChoice | `"A,B,C"` | Subset check | Yes (2 points for partial correct) |
| True/False Questions | `"True,False,True"` | Exact match | No |
| ShortAnswerQuestion | Free text | LLM-as-judge | Yes (scored 0 to max points) |

For short-answer questions, an LLM evaluates the answer based on accuracy, completeness, logical consistency, and clarity.

## Training data templates

See the example files in:

- `result.jsonl`: Detailed output information
- `summary.json`: Summary of model results
- `reference`: Original test scores (ground truth student performance)
- `score`: Test scores
- `score_by_test_paper`: Test score by test paper
- `data/sft/course_exam_sft_example.jsonl`: Format for supervised fine-tuning
- `data/pretrain/course_exam_pretrain_example.jsonl`: Format for pre-training
Loading