-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Multi-Domain Goal-Oriented Dialogues (MultiDoGO)
This corpus converts the Multi-Domain Goal-Oriented Dialogues (MultiDoGO) dataset into ConvoKit format across six domains: airline, fastfood, finance, insurance, media, and software. It provides three corpora: (1) unannotated dialogs, (2) sentence-level annotated splits, and (3) turn-level annotated splits.
Overview
This corpus adapts the MultiDoGO dataset into ConvoKit format. It includes:
- Unannotated human-to-human dialogs
- Annotated paper splits at sentence level and turn level (train/dev/test)
Dataset Details
Utterance
text: original utterance textmeta:domain: one of{airline, fastfood, finance, insurance, media, software}split:{train, dev, test}(empty for unannotated)turnNumber: original turn index (string)sentenceNumber: sentence index for sentence-level splits (string; empty otherwise)raw_utteranceId: original utteranceIdslot_labels: space-separated labelsintent: single or multiple intents (if multiple, separated by<div>)
Speaker
id: `domain__role`meta.role: originalauthorRole(fallbackcustomer)meta.domain: domain
Conversation
id: `domain__conversationId`meta:raw_conversationIddomainsplit- (optional)
num_utterances
Conversion Choices
- ID scheme:
Utterance.id = domain__conversationId__turnNumber__sentenceNumber__utteranceId - Reply chain:
reply_tolinks to the previous utterance in the same conversation. - Speaker mapping: use
authorRole(fallbackcustomer), namespaced bydomain. - Label handling:
intentuses<div>for multi-intent;slot_labelssplit by space. - Robust parsing: tolerant TSV reading via
pandas.read_csv(..., engine="python", on_bad_lines="skip"); coerce non-numericturnNumber/sentenceNumberby regex for ordering. - Conversation objects: created implicitly by
Corpus(utterances=...), then conversation metadata (conv.meta) updated.
Statistics
Unannotated
- Utterances: 1,376,153
- Conversations: 86,716
- Speakers: 12
- Domains (by conversations):
domain convs media 33,322 airline 15,098 insurance 14,259 fastfood 9,642 finance 8,833 software 5,562 - Splits: (none — unannotated)
- Top intents / slots: (not applicable in unannotated)
Sentence-level splits (annotated)
- Utterances: 121,636
- Conversations: 14,215
- Speakers: 6
- Domains (by conversations):
domain convs media 2,432 airline 2,430 finance 2,379 software 2,355 insurance 2,332 fastfood 2,287 - Splits (by conversations): train 9,948 • test 2,848 • dev 1,419
- Top intents (k=10): contentonly 39,948; confirmation 18,009; openinggreeting 13,007; rejection 10,761; thankyou 10,403; outofdomain 6,448; closinggreeting 3,765; startserviceintent 2,224; orderpizzaintent 2,009; orderdrinkintent 1,916
- Top slot labels (k=10): O 318,446; name 12,373; address 12,284; booking_confirmation_number 6,883; food_item 6,596; ingredient 4,918; datacategoryvalues 4,563; ssn 4,010; quantity 3,670; email_address 3,473
Turn-level splits (annotated)
- Utterances: 116,597
- Conversations: 14,147
- Speakers: 6
- Domains (by conversations):
domain convs media 2,420 finance 2,406 airline 2,404 software 2,362 insurance 2,321 fastfood 2,234 - Splits (by conversations): train 9,900 • test 2,834 • dev 1,413
- Top intents (k=10): contentonly 39,248; confirmation 16,666; openinggreeting 12,618; rejection 10,903; thankyou 9,712; outofdomain 4,810; closinggreeting 3,111; startserviceintent 2,450; orderdrinkintent 1,876; orderpizzaintent 1,780
- Top slot labels (k=10): O 317,541; name 12,683; address 11,901; food_item 7,413; booking_confirmation_number 6,853; datacategoryvalues 4,474; quantity 4,025; ssn 3,862; email_address 3,466; card_number 3,390
Aggregated (deduplicated across corpora)
- Utterances (unique): 1,614,386
- Conversations (unique): 101,111
- Speakers (unique): 18
- Domains (by conversations, normalized):
domain convs media 35,771 (= 33,322 unannotated + 2,449 annotated) airline 17,535 (= 15,098 + 2,437) insurance 16,606 (= 14,259 + 2,347) fastfood 11,986 (= 9,642 + 2,344) finance 11,271 (= 8,833 + 2,438) software 7,942 (= 5,562 + 2,380) - Top intents (aggregate, k=10): contentonly 79,196; confirmation 34,675; openinggreeting 25,625; rejection 21,664; thankyou 20,115; outofdomain 11,258; closinggreeting 6,876; startserviceintent 4,674; orderdrinkintent 3,792; orderpizzaintent 3,789
- Top slot labels (aggregate, k=10): O 635,987; name 25,056; address 24,185; food_item 14,009; booking_confirmation_number 13,736; datacategoryvalues 9,037; ingredient 8,186; ssn 7,872; quantity 7,695; email_address 6,939
Contact
The original dataset was distributed in the EMNLP-2019 paper: Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data
The dataset was formatted for Convokit by Saebyeol Shin ([email protected])
Dataset Access
Dataset with example script: https://drive.google.com/drive/folders/1BnFMFiGkVA1bUGY-IDzkvOXpZ-zYOVPn?usp=sharing