-
Notifications
You must be signed in to change notification settings - Fork 137
Description
[Dataset] MathDial Corpus (math word problem tutoring dialogs)
Dataset name
MathDial (ConvoKit format)
Brief description
A collection of human–tutor dialogues focused on solving math word problems.
This dataset is adapted from the original MathDial corpus and converted into the ConvoKit format.
It contains both the original 4-way tutor intents and fine-grained 11-way tutor intents (for teacher turns).
Dataset details
Speaker-level
Speakers are either teachers or students. Some conversations include named students (e.g., Cody, Mariana), while others use the generic “Student.”
Each speaker is represented consistently within a conversation.
- id:
<Name>_<conversation_id>(e.g.,Teacher_conv12,DeAndre_conv284) - meta.role: normalized role —
"teacher"or"student" - meta.role_raw: the original value from the TSV (e.g.,
"Teacher","Student","DeAndre") - meta.conversation_id: conversation this speaker belongs to (e.g.,
conv284) - meta.split: dataset split (
train,val,test)
Utterance-level
Each conversational turn is an utterance.
- id: global utterance identifier
- speaker: speaker who produced the utterance
- conversation_id: conversation identifier (e.g.,
conv0,conv1, …) - reply_to: previous utterance in the thread (None if start)
- timestamp: not available
- text: textual content of the utterance
Metadata for each utterance includes:
intent_4: coarse 4-way intent labelintent_11: fine 11-way intent label (teacher turns only)qid: problem identifier
Conversation-level
Metadata for each conversation includes:
- conversation_id: global string identifier (
conv0,conv1, …) - split: which split (
train,val,test) - qid: problem ID
- scenario: description of the problem context
- question: math problem text
- ground_truth: correct solution
- student_incorrect_solution: incorrect solution given by the student
- student_profile: information about the student (if available)
- teacher_described_confusion, self-correctness, self-typical-confusion, self-typical-interactions: additional annotation fields
Corpus-level
- name:
mathdial - source_repo: https://github.com/Kpetyxova/autoTree/tree/main/mathdial
- description: MathDial dialogs converted to ConvoKit with both 4-way and 11-way tutor intents. IDs are namespaced by split.
- num_utterances: 8,466
- num_conversations: 521
- num_speakers: 1,043
Licensing information
The MathDial dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
Citation
Petukhova, Kseniia, and Ekaterina Kochmar. "Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation." arXiv preprint arXiv:2506.07626, 2025.
Contact
Jadon Geathers — [email protected]
Access
Here is a link to the zipped corpus:
mathdial.zip
Statistics
- Conversations: 521
- Utterances: 8,466
- Speakers: 1,043
Top fine-grained (11-way) tutor intents:
| Intent | Count |
|---|---|
| Revealing Strategy | 1141 |
| Revealing Answer | 895 |
| Guiding Student Focus | 687 |
| Seek Strategy | 658 |
| Asking for Explanation | 653 |
| Seeking Self Correction | 643 |
| Seeking World Knowledge | 257 |
| Greeting/Farewell | 217 |
| Recall Relevant Information | 93 |
| Perturbing the Question | 89 |
| General Inquiry | 40 |
Example usage
from convokit import Corpus
corpus = Corpus("PATH_TO/mathdial")
corpus.print_summary_stats()