[Dataset] MathDial Dataset

[Dataset] MathDial Corpus (math word problem tutoring dialogs)

## Dataset name
**MathDial (ConvoKit format)**

## Brief description
A collection of human–tutor dialogues focused on solving math word problems.  
This dataset is adapted from the original **MathDial** corpus and converted into the [ConvoKit](https://convokit.cornell.edu) format.  
It contains both the original **4-way tutor intents** and fine-grained **11-way tutor intents** (for teacher turns).

---

## Dataset details

### Speaker-level
Speakers are either teachers or students. Some conversations include **named students** (e.g., *Cody*, *Mariana*), while others use the generic “Student.”  
Each speaker is represented consistently within a conversation.

- **id**: `<Name>_<conversation_id>` (e.g., `Teacher_conv12`, `DeAndre_conv284`)  
- **meta.role**: normalized role — `"teacher"` or `"student"`  
- **meta.role_raw**: the original value from the TSV (e.g., `"Teacher"`, `"Student"`, `"DeAndre"`)  
- **meta.conversation_id**: conversation this speaker belongs to (e.g., `conv284`)  
- **meta.split**: dataset split (`train`, `val`, `test`)  

### Utterance-level
Each conversational turn is an utterance.

- **id**: global utterance identifier  
- **speaker**: speaker who produced the utterance  
- **conversation_id**: conversation identifier (e.g., `conv0`, `conv1`, …)  
- **reply_to**: previous utterance in the thread (None if start)  
- **timestamp**: not available  
- **text**: textual content of the utterance  

*Metadata for each utterance includes:*  
- `intent_4`: coarse 4-way intent label  
- `intent_11`: fine 11-way intent label (teacher turns only)  
- `qid`: problem identifier  

### Conversation-level
Metadata for each conversation includes:

- **conversation_id**: global string identifier (`conv0`, `conv1`, …)  
- **split**: which split (`train`, `val`, `test`)  
- **qid**: problem ID  
- **scenario**: description of the problem context  
- **question**: math problem text  
- **ground_truth**: correct solution  
- **student_incorrect_solution**: incorrect solution given by the student  
- **student_profile**: information about the student (if available)  
- **teacher_described_confusion**, **self-correctness**, **self-typical-confusion**, **self-typical-interactions**: additional annotation fields  

### Corpus-level
- **name**: `mathdial`  
- **source_repo**: https://github.com/Kpetyxova/autoTree/tree/main/mathdial  
- **description**: MathDial dialogs converted to ConvoKit with both 4-way and 11-way tutor intents. IDs are namespaced by split.  
- **num_utterances**: 8,466  
- **num_conversations**: 521  
- **num_speakers**: 1,043  

---

## Licensing information
The MathDial dataset is released under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

---

## Citation
Petukhova, Kseniia, and Ekaterina Kochmar. "Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation." arXiv preprint arXiv:2506.07626, 2025.

---

## Contact
Jadon Geathers — jag569@cornell.edu

---

## Access
Here is a link to the zipped corpus: 
[mathdial.zip](https://github.com/user-attachments/files/22305062/mathdial.zip)

---

## Statistics
- **Conversations:** 521  
- **Utterances:** 8,466  
- **Speakers:** 1,043  

Top fine-grained (11-way) tutor intents:

| Intent                     | Count |
|-----------------------------|-------|
| Revealing Strategy          | 1141  |
| Revealing Answer            | 895   |
| Guiding Student Focus       | 687   |
| Seek Strategy               | 658   |
| Asking for Explanation      | 653   |
| Seeking Self Correction     | 643   |
| Seeking World Knowledge     | 257   |
| Greeting/Farewell           | 217   |
| Recall Relevant Information | 93    |
| Perturbing the Question     | 89    |
| General Inquiry             | 40    |

---

## Example usage
```python
from convokit import Corpus

corpus = Corpus("PATH_TO/mathdial")
corpus.print_summary_stats()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dataset] MathDial Dataset #313

Dataset name

Brief description

Dataset details

Speaker-level

Utterance-level

Conversation-level

Corpus-level

Licensing information

Citation

Contact

Access

Statistics

Example usage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intent	Count
Revealing Strategy	1141
Revealing Answer	895
Guiding Student Focus	687
Seek Strategy	658
Asking for Explanation	653
Seeking Self Correction	643
Seeking World Knowledge	257
Greeting/Farewell	217
Recall Relevant Information	93
Perturbing the Question	89
General Inquiry	40

[Dataset] MathDial Dataset #313

Description

Dataset name

Brief description

Dataset details

Speaker-level

Utterance-level

Conversation-level

Corpus-level

Licensing information

Citation

Contact

Access

Statistics

Example usage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions