-
Notifications
You must be signed in to change notification settings - Fork 10
Course Exam Benchmark: Restructure Data Format #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- split exam metadata from questions - replace PySpark with pandas - new 4-file output: results, detailed, summary, comparison - track answered/unanswered/correct/incorrect - consistent naming (exam_id, instance_id) - extend README with guide on how to add a new exam Signed-off-by: Tarek <[email protected]>
|
Resolves #6 |
Signed-off-by: Tarek <[email protected]>
xuafeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with suggestions.
Thanks a lot, Tarek. This PR is very clear and well-organized. I left a few minor suggestions. Thanks.
| - `questions.jsonl`: Individual questions (one JSON object per line that links to an exam from `exams_metadata.json` via `exam_id`) | ||
|
|
||
| ## How to extend the benchmark | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, Tarek. All is very clear and well-organized.
One small suggestion: can we add some sentences to set the prerequisites: assume we already have one Course Exam like this https://pdos.csail.mit.edu/6.824/quizzes/q25-2-sol.pdf. And then, all the following steps is based on this example quiz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
| { | ||
| "exams": [ | ||
| { | ||
| "exam_id": "6_5840_distributed_system_engineering_spring_2025_exam_i", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add university name in the id and name? I see there will be many exams from different universities. Should we distinguish them? I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the combination of course ID, name, and semester should be unique in most cases but we can always add it later as an optional field
| You can see the detailed information of each exam in the table below. | ||
| This creates a Python virtual environment and installs required packages | ||
|
|
||
| | Course | # of questions | Score (gpt-4 1) (score/total) | Score (gpt-4o) (score/total) | Score (o3-mini) (score/total) | Student Score (max/average/medium) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we can have one separate md file to show the current measurement results. People may have interest in the results.
Signed-off-by: Tarek <[email protected]>
…ourse_exam_bench Course Exam Benchmark: Restructure Data Format
Course Exam Benchmark: Restructure Data Format
SystemTestPaper.jsonl, timestamped versions,.xlsxfiles, templates)id/instance_idandtest_paper_name/exam_idexams_metadata.json: Exam-level dataquestions.jsonl: Question-level dataexam_id. Exam metadata stored once in separate fileresult.jsonl,avg_score.jsonresults.jsonl: Minimal per-question resultsresults_detailed.jsonl: Full debugging info (prompts, explanations)summary.json: Statistics with answered/unanswered/correct/incorrect trackingcomparison.json: LLM vs student performance comparison