Skip to content

[Dataset] MathDial Dataset #313

@jadongeathers

Description

@jadongeathers

[Dataset] MathDial Corpus (math word problem tutoring dialogs)

Dataset name

MathDial (ConvoKit format)

Brief description

A collection of human–tutor dialogues focused on solving math word problems.
This dataset is adapted from the original MathDial corpus and converted into the ConvoKit format.
It contains both the original 4-way tutor intents and fine-grained 11-way tutor intents (for teacher turns).


Dataset details

Speaker-level

Speakers are either teachers or students. Some conversations include named students (e.g., Cody, Mariana), while others use the generic “Student.”
Each speaker is represented consistently within a conversation.

  • id: <Name>_<conversation_id> (e.g., Teacher_conv12, DeAndre_conv284)
  • meta.role: normalized role — "teacher" or "student"
  • meta.role_raw: the original value from the TSV (e.g., "Teacher", "Student", "DeAndre")
  • meta.conversation_id: conversation this speaker belongs to (e.g., conv284)
  • meta.split: dataset split (train, val, test)

Utterance-level

Each conversational turn is an utterance.

  • id: global utterance identifier
  • speaker: speaker who produced the utterance
  • conversation_id: conversation identifier (e.g., conv0, conv1, …)
  • reply_to: previous utterance in the thread (None if start)
  • timestamp: not available
  • text: textual content of the utterance

Metadata for each utterance includes:

  • intent_4: coarse 4-way intent label
  • intent_11: fine 11-way intent label (teacher turns only)
  • qid: problem identifier

Conversation-level

Metadata for each conversation includes:

  • conversation_id: global string identifier (conv0, conv1, …)
  • split: which split (train, val, test)
  • qid: problem ID
  • scenario: description of the problem context
  • question: math problem text
  • ground_truth: correct solution
  • student_incorrect_solution: incorrect solution given by the student
  • student_profile: information about the student (if available)
  • teacher_described_confusion, self-correctness, self-typical-confusion, self-typical-interactions: additional annotation fields

Corpus-level


Licensing information

The MathDial dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).


Citation

Petukhova, Kseniia, and Ekaterina Kochmar. "Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation." arXiv preprint arXiv:2506.07626, 2025.


Contact

Jadon Geathers — [email protected]


Access

Here is a link to the zipped corpus:
mathdial.zip


Statistics

  • Conversations: 521
  • Utterances: 8,466
  • Speakers: 1,043

Top fine-grained (11-way) tutor intents:

Intent Count
Revealing Strategy 1141
Revealing Answer 895
Guiding Student Focus 687
Seek Strategy 658
Asking for Explanation 653
Seeking Self Correction 643
Seeking World Knowledge 257
Greeting/Farewell 217
Recall Relevant Information 93
Perturbing the Question 89
General Inquiry 40

Example usage

from convokit import Corpus

corpus = Corpus("PATH_TO/mathdial")
corpus.print_summary_stats()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions