-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Ubuntu Chat Logs Misalignment Corpus
The conversations feature pairs of speakers where 1 speaker is assisting the other through Ubuntu chat logs to help them solve their problem. Human annotated friction points are included, along with friction points identified by GPT4o, GPT4omini, Llama 70B, and Llama 8B.
Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370 (2025)
Dataset Details
Speaker-level information
Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:
- role_A_count: number of conversations where the speaker served in role A
- role_B_count: number of conversations where the speaker served in role B
Utterance-level Information
The following data is provided:
- id: unique id of the utterance
- speaker: the speaker who authored the utterance
- conversation_id: unique id of the conversation
- reply_to: index of the utterance to which this is a reply to (None if the utterance is not a reply)
- timestamp: the index of the utterance in the conversation
- text: textual content of the utterance
Metadata for utterances include: - time_elapsed: number of minutes elapsed since the start of the conversation
- gpt_explanation: an explanation of the utterance, generated by ChatGPT
- conversational_friction: conversational friction scores, generated by the original authors of the paper
- explanation: human-generated explanation of the utterance
Conversational-level Information
For each conversation we provide:
- id: an unique index of the conversation
Metadata for conversations include: - batch: the batch in which the conversation is sorted into
- duration: number of minutes elapsed since the start of the conversation
- role_A: speaker id for the one serving in role A for this conversation
- role_B: speaker id for the one serving in role B for this conversation
- ending: type of ending the conversation had (natural end, abrupt, or ran out of time)
- conversational_success: success of conversation in resolving question from role A speaker (success, some progress, or no progress)
For the human annotators and for each model, the following metadata is provided: - conversational_friction_present_[model]: whether friction is detected anywhere in the conversation by [model]
- friction_count_[model]: number of instances of conversational friction detected by [model]
- friction_index_list_[model]: list of instances of conversational friction within this conversation detected by [model]
- explanation_list_[model]: list of explanations for each friction instance generated by [model]
Basic Stats: ubuntu-chat-logs
- Number of utterances: 7950
- Number of conversations: 200
- Number of speakers: 361
Contact
ConvoKit formatted corpus was created by Axel Bax ([email protected]) from the dataset created by Sarkar et al.
Corresponding Author: Rupak Sarkar ([email protected])
Data Access
Find the zipped ConvoKit-formatted corpus here: ubuntu-chat-logs.zip