Course Exam Benchmark: Restructure Data Format #11

tareknaser · 2025-11-14T20:36:33Z

Data Restructuring
- Before:
  - Multiple files (SystemTestPaper.jsonl, timestamped versions, .xlsx files, templates)
  - Each exam question repeated exam metadata (course, year, student stats)
  - Mixed use of id/instance_id and test_paper_name/exam_id
- After:
  - Two-file structure to minimize duplication
    - exams_metadata.json: Exam-level data
    - questions.jsonl: Question-level data
  - Questions reference exams via exam_id. Exam metadata stored once in separate file
Output Format (4 files instead of 2)
- Before:
  - result.jsonl, avg_score.json
- After:
  - results.jsonl: Minimal per-question results
  - results_detailed.jsonl: Full debugging info (prompts, explanations)
  - summary.json: Statistics with answered/unanswered/correct/incorrect tracking
  - comparison.json: LLM vs student performance comparison
Added tests including data format and output format validation (to also run in GitHub CI)
README: Added "How to extend the benchmark" section

- split exam metadata from questions - replace PySpark with pandas - new 4-file output: results, detailed, summary, comparison - track answered/unanswered/correct/incorrect - consistent naming (exam_id, instance_id) - extend README with guide on how to add a new exam Signed-off-by: Tarek <[email protected]>

tareknaser · 2025-11-14T20:36:56Z

Resolves #6

Signed-off-by: Tarek <[email protected]>

xuafeng

Approved with suggestions.
Thanks a lot, Tarek. This PR is very clear and well-organized. I left a few minor suggestions. Thanks.

xuafeng · 2025-11-14T22:48:17Z

benchmarks/course_exam_bench/README.md

+- `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`)
+
+## How to extend the benchmark
+


Thanks a lot, Tarek. All is very clear and well-organized.
One small suggestion: can we add some sentences to set the prerequisites: assume we already have one Course Exam like this https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf. And then, all the following steps is based on this example quiz.

xuafeng · 2025-11-14T22:51:23Z

benchmarks/course_exam_bench/data/benchmark/exams_metadata.json

+{
+  "exams": [
+    {
+      "exam_id": "6_5840_distributed_system_engineering_spring_2025_exam_i",


Should we add university name in the id and name? I see there will be many exams from different universities. Should we distinguish them? I am not sure.

I think the combination of course ID, name, and semester should be unique in most cases but we can always add it later as an optional field

xuafeng · 2025-11-14T23:19:56Z

benchmarks/course_exam_bench/README.md

-You can see the detailed information of each exam in the table below.
+This creates a Python virtual environment and installs required packages

-| Course                                                                 | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/medium) |


I am thinking if we can have one separate md file to show the current measurement results. People may have interest in the results.

Signed-off-by: Tarek <[email protected]>

…ourse_exam_bench Course Exam Benchmark: Restructure Data Format

Course Exam Benchmark: Restructure Data Format

fix(ci): only load config file when needed

32b406f

Signed-off-by: Tarek <[email protected]>

xuafeng requested changes Nov 14, 2025

View reviewed changes

chore(course_exam_bench): update 6_1810 exam_id for consistency

4d18658

Signed-off-by: Tarek <[email protected]>

xuafeng merged commit 1577aca into main Nov 17, 2025
2 checks passed

xuafeng deleted the docs_course_exam_bench branch November 17, 2025 18:09

Couen pushed a commit to Couen/system-intelligence-benchmark that referenced this pull request Jan 22, 2026

Merge pull request sys-intelligence#11 from systemintelligence/docs_c…

1bf8855

…ourse_exam_bench Course Exam Benchmark: Restructure Data Format

tareknaser pushed a commit that referenced this pull request Feb 5, 2026

Merge pull request #11 from systemintelligence/docs_course_exam_bench

c53fb3b

Course Exam Benchmark: Restructure Data Format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Course Exam Benchmark: Restructure Data Format #11

Course Exam Benchmark: Restructure Data Format #11

Uh oh!

tareknaser commented Nov 14, 2025

Uh oh!

tareknaser commented Nov 14, 2025

Uh oh!

xuafeng left a comment

Uh oh!

xuafeng Nov 14, 2025

Uh oh!

tareknaser Nov 17, 2025

Uh oh!

xuafeng Nov 14, 2025 •

edited

Loading

Uh oh!

tareknaser Nov 17, 2025

Uh oh!

xuafeng Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`)

		## How to extend the benchmark

Course Exam Benchmark: Restructure Data Format #11

Course Exam Benchmark: Restructure Data Format #11

Uh oh!

Conversation

tareknaser commented Nov 14, 2025

Uh oh!

tareknaser commented Nov 14, 2025

Uh oh!

xuafeng left a comment

Choose a reason for hiding this comment

Uh oh!

xuafeng Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

tareknaser Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

xuafeng Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tareknaser Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

xuafeng Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuafeng Nov 14, 2025 •

edited

Loading