Skip to content

Commit 21f7026

Browse files
committed
chore(course_exam_bench): update 6_1810 exam_id for consistency
Signed-off-by: Tarek <[email protected]>
1 parent 1a3ec43 commit 21f7026

File tree

4 files changed

+50
-38
lines changed

4 files changed

+50
-38
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Evaluation Results
2+
3+
| Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/median) |
4+
| ---------------------------------------------------------- | -------------- | ----------------------------- | ---------------------------- | ----------------------------- | ---------------------------------- |
5+
| 6.5840 Distributed System Engineering: Spring 2025 Exam I | 11 | 29/65 | 27/65 | 25/65 | 65/ **51.8** /52 |
6+
| 6.5840 Distributed System Engineering: Spring 2024 Exam I | 15 | 54/95 | 55/95 | 42/95 | 95/ **77** /78 |
7+
| 6.5840 Distributed System Engineering: Spring 2024 Exam II | 14 | 24/71 | 24/71 | 36/71 | 72/ **56.6** /57 |
8+
| 6.1810 Fall 2024 MIT 6.1810 Operating System Engineering | 14 | 35/70 | 40/70 | 52/70 | 65/ **49.8** /49 |

benchmarks/course_exam_bench/README.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ This benchmark evaluates the performance of Large Language Models (LLMs) on syst
66
- Question types: Single-choice, multiple-choice, true/false, and short-answer
77
- Includes real student performance data for comparison
88

9+
For current model evaluation results, see [EVALUATION_RESULTS.md](EVALUATION_RESULTS.md).
10+
911
| Exam | Questions | Topics |
1012
| ------------------------------ | --------- | ------------------- |
1113
| MIT 6.5840 Spring 2025 Exam I | 11 | Distributed Systems |
@@ -76,7 +78,7 @@ For each question, one JSON object per line:
7678
```json
7779
{
7880
"instance_id": 1,
79-
"exam_id": "6_1810_fall_2024_quiz_ii_solutions",
81+
"exam_id": "6_1810_operating_system_engineering_fall_2024_quiz_ii",
8082
"question_type": "SingleChoice",
8183
"llm_answer": "C",
8284
"correct_answer": "C",
@@ -118,45 +120,47 @@ The benchmark data is stored in `data/benchmark/`:
118120

119121
## How to extend the benchmark
120122

123+
Consider this [MIT 6.824 Distributed Systems quiz](https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf). The steps below show how to add this exam to the benchmark. The same process applies to any course exam you want to include.
124+
121125
### Step 1: Add exam metadata to `exams_metadata.json`
122126

123-
Create a unique `exam_id` for your exam:
127+
Create a unique `exam_id` for your exam. Here's the actual entry for the Spring 2024 Exam II:
124128

125129
```json
126130
{
127-
"exam_id": "your_university_course_year_semester_exam",
128-
"test_paper_name": "Your University Course Name: Semester Year Exam",
129-
"course": "Course Name",
130-
"year": 2025,
131-
"score_total": 100,
132-
"score_max": 95.0,
133-
"score_avg": 75.0,
134-
"score_median": 77.0,
135-
"score_standard_deviation": 10.5,
136-
"num_questions": 10
131+
"exam_id": "6_5840_distributed_system_engineering_spring_2024_exam_ii",
132+
"test_paper_name": "6.5840 Distributed System Engineering: Spring 2024 Exam II",
133+
"course": "Distributed System Engineering",
134+
"year": 2024,
135+
"score_total": 71,
136+
"score_max": 71.0,
137+
"score_avg": 56.61,
138+
"score_median": 57,
139+
"score_standard_deviation": 9.13,
140+
"num_questions": 14
137141
}
138142
```
139143

140144
### Step 2: Add individual questions to `questions.jsonl`
141145

142-
Append your questions to the file. Each line is a JSON object:
146+
Append your questions to the file. Each line is a JSON object. Here's an example from the exam (a True/False question about FaRM):
143147

144148
```json
145149
{
146-
"instance_id": 70,
147-
"exam_id": "your_university_course_year_semester_exam",
148-
"problem_num": 1,
149-
"points": 10,
150-
"problem": "Explain the difference between a process and a thread.",
151-
"answer": "A process is an instance of a running program with its own memory space, while a thread is a unit of execution within a process that shares the process's memory.",
152-
"explanation": "Full explanation here...",
153-
"type": "ShortAnswerQuestion"
150+
"instance_id": 33,
151+
"exam_id": "6_5840_distributed_system_engineering_spring_2024_exam_ii",
152+
"problem_num": 4,
153+
"points": 8,
154+
"problem": "# III FaRM \n\nConsider the following statements about FaRM as described in No compromises: distributed transactions with consistency, availability, and performance. For each statement, circle True or False. \n\n4. [8 points]: \n\nTrue / False : Because FaRM uses primary-backup replication for a region (instead of Paxos), FaRM must reconfigure to remove a failed replica before FaRM can continue to use the region. \n\nTrue / False : FaRM can use short leases (10ms by default) because it has communication and scheduling optimizations to renew leases quickly. \n\nTrue / False : A transaction that modifies only one object will never abort. \n\nTrue / False : Read-only transactions require only the validate step of the Commit phase in Figure 4. ",
155+
"answer": "True,True,False,True",
156+
"explanation": "Answer: True, True, False, True. The first statement is true because FaRM requires a response from all replicas, thus it must reconfigure to remove the failed replica before it can continue with the affected shard. The third statement is false because another transaction may modify the one object causing this transaction's validation phase to fail (because the other transaction will have incremented the object's version number).",
157+
"type": "True/False Questions"
154158
}
155159
```
156160

157161
Required fields:
158162

159-
- `instance_id`: Globally unique number (use next available number, currently 70+)
163+
- `instance_id`: Globally unique number (use next available number)
160164
- `exam_id`: Must match the `exam_id` from Step 1
161165
- `problem_num`: Question number within the exam (1, 2, 3, ...)
162166
- `points`: Points allocated to this question

benchmarks/course_exam_bench/data/benchmark/exams_metadata.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@
4949
"num_questions": 14
5050
},
5151
{
52-
"exam_id": "6_1810_fall_2024_quiz_ii_solutions",
53-
"test_paper_name": "6.1810 Fall 2024 Quiz II Solutions",
52+
"exam_id": "6_1810_operating_system_engineering_fall_2024_quiz_ii",
53+
"test_paper_name": "6.1810 Fall 2024 Quiz II",
5454
"course": "Operating System Engineering",
5555
"year": 2024,
5656
"score_total": 70,

0 commit comments

Comments
 (0)