Skip to content

Conversation

@tareknaser
Copy link
Collaborator

  • Data Restructuring
    • Before:
      • Multiple files (SystemTestPaper.jsonl, timestamped versions, .xlsx files, templates)
      • Each exam question repeated exam metadata (course, year, student stats)
      • Mixed use of id/instance_id and test_paper_name/exam_id
    • After:
      • Two-file structure to minimize duplication
        • exams_metadata.json: Exam-level data
        • questions.jsonl: Question-level data
      • Questions reference exams via exam_id. Exam metadata stored once in separate file
  • Output Format (4 files instead of 2)
    • Before:
      • result.jsonl, avg_score.json
    • After:
      • results.jsonl: Minimal per-question results
      • results_detailed.jsonl: Full debugging info (prompts, explanations)
      • summary.json: Statistics with answered/unanswered/correct/incorrect tracking
      • comparison.json: LLM vs student performance comparison
  • Added tests including data format and output format validation (to also run in GitHub CI)
  • README: Added "How to extend the benchmark" section

- split exam metadata from questions
- replace PySpark with pandas
- new 4-file output: results, detailed, summary, comparison
- track answered/unanswered/correct/incorrect
- consistent naming (exam_id, instance_id)
- extend README with guide on how to add a new exam

Signed-off-by: Tarek <[email protected]>
@tareknaser
Copy link
Collaborator Author

Resolves #6

Copy link
Collaborator

@xuafeng xuafeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with suggestions.
Thanks a lot, Tarek. This PR is very clear and well-organized. I left a few minor suggestions. Thanks.

- `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`)

## How to extend the benchmark

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, Tarek. All is very clear and well-organized.
One small suggestion: can we add some sentences to set the prerequisites: assume we already have one Course Exam like this https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf. And then, all the following steps is based on this example quiz.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

{
"exams": [
{
"exam_id": "6_5840_distributed_system_engineering_spring_2025_exam_i",
Copy link
Collaborator

@xuafeng xuafeng Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add university name in the id and name? I see there will be many exams from different universities. Should we distinguish them? I am not sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the combination of course ID, name, and semester should be unique in most cases but we can always add it later as an optional field

You can see the detailed information of each exam in the table below.
This creates a Python virtual environment and installs required packages

| Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/medium) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we can have one separate md file to show the current measurement results. People may have interest in the results.

@xuafeng xuafeng merged commit 1577aca into main Nov 17, 2025
2 checks passed
@xuafeng xuafeng deleted the docs_course_exam_bench branch November 17, 2025 18:09
Couen pushed a commit to Couen/system-intelligence-benchmark that referenced this pull request Jan 22, 2026
…ourse_exam_bench

Course Exam Benchmark: Restructure Data Format
tareknaser pushed a commit that referenced this pull request Feb 5, 2026
Course Exam Benchmark: Restructure Data Format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants