Skip to content

[Dataset] Ubuntu Chat Logs #309

@xablexa

Description

@xablexa

Ubuntu Chat Logs Misalignment Corpus

The conversations feature pairs of speakers where 1 speaker is assisting the other through Ubuntu chat logs to help them solve their problem. Human annotated friction points are included, along with friction points identified by GPT4o, GPT4omini, Llama 70B, and Llama 8B.

Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs. Rupak Sarkar, Neha Srikanth, Taylor Hudson, Rachel Rudinger, Claire Bonial, Philip Resnik. arXiv preprint arXiv:2503.12370 (2025)

Dataset Details

Speaker-level information

Speakers in this dataset are troubleshooting problems together. Usually one person has a problem and another is helping them. Speakers are always in pairs. Role A denotes the person seeking assistance. As speakers can take part in multiple conversations, we track the following metadata:

  • role_A_count: number of conversations where the speaker served in role A
  • role_B_count: number of conversations where the speaker served in role B

Utterance-level Information

The following data is provided:

  • id: unique id of the utterance
  • speaker: the speaker who authored the utterance
  • conversation_id: unique id of the conversation
  • reply_to: index of the utterance to which this is a reply to (None if the utterance is not a reply)
  • timestamp: the index of the utterance in the conversation
  • text: textual content of the utterance
    Metadata for utterances include:
  • time_elapsed: number of minutes elapsed since the start of the conversation
  • gpt_explanation: an explanation of the utterance, generated by ChatGPT
  • conversational_friction: conversational friction scores, generated by the original authors of the paper
  • explanation: human-generated explanation of the utterance

Conversational-level Information

For each conversation we provide:

  • id: an unique index of the conversation
    Metadata for conversations include:
  • batch: the batch in which the conversation is sorted into
  • duration: number of minutes elapsed since the start of the conversation
  • role_A: speaker id for the one serving in role A for this conversation
  • role_B: speaker id for the one serving in role B for this conversation
  • ending: type of ending the conversation had (natural end, abrupt, or ran out of time)
  • conversational_success: success of conversation in resolving question from role A speaker (success, some progress, or no progress)
    For the human annotators and for each model, the following metadata is provided:
  • conversational_friction_present_[model]: whether friction is detected anywhere in the conversation by [model]
  • friction_count_[model]: number of instances of conversational friction detected by [model]
  • friction_index_list_[model]: list of instances of conversational friction within this conversation detected by [model]
  • explanation_list_[model]: list of explanations for each friction instance generated by [model]

Basic Stats: ubuntu-chat-logs

  • Number of utterances: 7950
  • Number of conversations: 200
  • Number of speakers: 361

Contact

ConvoKit formatted corpus was created by Axel Bax ([email protected]) from the dataset created by Sarkar et al.
Corresponding Author: Rupak Sarkar ([email protected])

Data Access

Find the zipped ConvoKit-formatted corpus here: ubuntu-chat-logs.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetUse this tag when providing a new dataset for inclusion in ConvoKit.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions