Skip to content

[Dataset] MultiDoGO Dataset #311

@SaebyeolShin

Description

@SaebyeolShin

Multi-Domain Goal-Oriented Dialogues (MultiDoGO)

This corpus converts the Multi-Domain Goal-Oriented Dialogues (MultiDoGO) dataset into ConvoKit format across six domains: airline, fastfood, finance, insurance, media, and software. It provides three corpora: (1) unannotated dialogs, (2) sentence-level annotated splits, and (3) turn-level annotated splits.

Overview

This corpus adapts the MultiDoGO dataset into ConvoKit format. It includes:

  • Unannotated human-to-human dialogs
  • Annotated paper splits at sentence level and turn level (train/dev/test)

Dataset Details

Utterance

  • text: original utterance text
  • meta:
    • domain: one of {airline, fastfood, finance, insurance, media, software}
    • split: {train, dev, test} (empty for unannotated)
    • turnNumber: original turn index (string)
    • sentenceNumber: sentence index for sentence-level splits (string; empty otherwise)
    • raw_utteranceId: original utteranceId
    • slot_labels: space-separated labels
    • intent: single or multiple intents (if multiple, separated by <div>)

Speaker

  • id: `domain__role`
  • meta.role: original authorRole (fallback customer)
  • meta.domain: domain

Conversation

  • id: `domain__conversationId`
  • meta:
    • raw_conversationId
    • domain
    • split
    • (optional) num_utterances

Conversion Choices

  • ID scheme: Utterance.id = domain__conversationId__turnNumber__sentenceNumber__utteranceId
  • Reply chain: reply_to links to the previous utterance in the same conversation.
  • Speaker mapping: use authorRole (fallback customer), namespaced by domain.
  • Label handling: intent uses <div> for multi-intent; slot_labels split by space.
  • Robust parsing: tolerant TSV reading via pandas.read_csv(..., engine="python", on_bad_lines="skip"); coerce non-numeric turnNumber/sentenceNumber by regex for ordering.
  • Conversation objects: created implicitly by Corpus(utterances=...), then conversation metadata (conv.meta) updated.

Statistics

Unannotated

  • Utterances: 1,376,153
  • Conversations: 86,716
  • Speakers: 12
  • Domains (by conversations):
    domain convs
    media 33,322
    airline 15,098
    insurance 14,259
    fastfood 9,642
    finance 8,833
    software 5,562
  • Splits: (none — unannotated)
  • Top intents / slots: (not applicable in unannotated)

Sentence-level splits (annotated)

  • Utterances: 121,636
  • Conversations: 14,215
  • Speakers: 6
  • Domains (by conversations):
    domain convs
    media 2,432
    airline 2,430
    finance 2,379
    software 2,355
    insurance 2,332
    fastfood 2,287
  • Splits (by conversations): train 9,948 • test 2,848 • dev 1,419
  • Top intents (k=10): contentonly 39,948; confirmation 18,009; openinggreeting 13,007; rejection 10,761; thankyou 10,403; outofdomain 6,448; closinggreeting 3,765; startserviceintent 2,224; orderpizzaintent 2,009; orderdrinkintent 1,916
  • Top slot labels (k=10): O 318,446; name 12,373; address 12,284; booking_confirmation_number 6,883; food_item 6,596; ingredient 4,918; datacategoryvalues 4,563; ssn 4,010; quantity 3,670; email_address 3,473

Turn-level splits (annotated)

  • Utterances: 116,597
  • Conversations: 14,147
  • Speakers: 6
  • Domains (by conversations):
    domain convs
    media 2,420
    finance 2,406
    airline 2,404
    software 2,362
    insurance 2,321
    fastfood 2,234
  • Splits (by conversations): train 9,900 • test 2,834 • dev 1,413
  • Top intents (k=10): contentonly 39,248; confirmation 16,666; openinggreeting 12,618; rejection 10,903; thankyou 9,712; outofdomain 4,810; closinggreeting 3,111; startserviceintent 2,450; orderdrinkintent 1,876; orderpizzaintent 1,780
  • Top slot labels (k=10): O 317,541; name 12,683; address 11,901; food_item 7,413; booking_confirmation_number 6,853; datacategoryvalues 4,474; quantity 4,025; ssn 3,862; email_address 3,466; card_number 3,390

Aggregated (deduplicated across corpora)

  • Utterances (unique): 1,614,386
  • Conversations (unique): 101,111
  • Speakers (unique): 18
  • Domains (by conversations, normalized):
    domain convs
    media 35,771 (= 33,322 unannotated + 2,449 annotated)
    airline 17,535 (= 15,098 + 2,437)
    insurance 16,606 (= 14,259 + 2,347)
    fastfood 11,986 (= 9,642 + 2,344)
    finance 11,271 (= 8,833 + 2,438)
    software 7,942 (= 5,562 + 2,380)
  • Top intents (aggregate, k=10): contentonly 79,196; confirmation 34,675; openinggreeting 25,625; rejection 21,664; thankyou 20,115; outofdomain 11,258; closinggreeting 6,876; startserviceintent 4,674; orderdrinkintent 3,792; orderpizzaintent 3,789
  • Top slot labels (aggregate, k=10): O 635,987; name 25,056; address 24,185; food_item 14,009; booking_confirmation_number 13,736; datacategoryvalues 9,037; ingredient 8,186; ssn 7,872; quantity 7,695; email_address 6,939

Contact

The original dataset was distributed in the EMNLP-2019 paper: Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data

The dataset was formatted for Convokit by Saebyeol Shin ([email protected])

Dataset Access

Dataset with example script: https://drive.google.com/drive/folders/1BnFMFiGkVA1bUGY-IDzkvOXpZ-zYOVPn?usp=sharing

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasetUse this tag when providing a new dataset for inclusion in ConvoKit.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions