-
Notifications
You must be signed in to change notification settings - Fork 137
Description
MeetingBank Corpus
A collection of transcribed city council meetings from 6 major U.S. cities, converted to ConvoKit format for conversational AI research and meeting summarization tasks. The data consists of 1,366 meetings with over 3,579 hours of video content, providing a rich dataset for studying political discourse, meeting dynamics, and automated summarization.
Attribution: Yebowen Hu, Tim Ganter, Hanieh Deilamsalehy, Franck Dernoncourt, Hassan Foroosh, Fei Liu, "MeetingBank: A Benchmark Dataset for Meeting Summarization," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, Canada, July 2023, pp. 4512-4522. [Online]. Available: https://zenodo.org/records/7989108
Dataset details
Speaker-level information
Speakers in the dataset are participants in city council meetings, including council members, city officials, and public speakers. Each speaker is identified by a unique identifier that combines the meeting name with their speaker number (e.g., "SeattleCityCouncil_12142015_speaker_0").
Speaker metadata includes:
- city: The city where the meeting took place
- meeting_name: The specific meeting identifier
- utterance_count: Total number of utterances contributed by this speaker
Utterance-level information
For each utterance (speech segment), we provide:
- id: An identifier for the utterance (comprised of the meeting ID concatenated with its index in the meeting)
- conversation_id: An identifier for the meeting/conversation to which the utterance belongs
- reply_to: ID of the previous utterance in the conversation (None if it's the first utterance)
- speaker: The speaker who delivered the utterance
- timestamp: Time offset of the utterance within the meeting (in microseconds)
- text: Transcribed textual content of the utterance
Utterance metadata:
- city: The city where the meeting took place
- meeting_name: The specific meeting identifier
- duration: Duration of the speech segment (in microseconds)
Conversational-level information
Each conversation represents a complete city council meeting. The conversation structure follows a linear progression where each utterance replies to the previous one, creating a chronological chain of the meeting proceedings. Conversations are organized by city and meeting date, with each meeting containing multiple agenda items and discussion segments.
Meeting metadata includes:
- city: The city where the meeting took place
- meeting_name: The specific meeting identifier
- num_speakers: Total number of unique speakers in the meeting
- total_utterances: Total number of speech segments in the meeting
- total_duration: Total duration of the meeting (in microseconds)
Quick stats
Number of conversations in the dataset = 1366
Number of speakers in the dataset = 12272
Number of utterances in the dataset = 1011870
=== CITY BREAKDOWN ===
Alameda: 164 transcripts
Boston: 32 transcripts
Denver: 401 transcripts
KingCounty: 132 transcripts
LongBeach: 310 transcripts
Seattle: 327 transcripts
Contact
Please email any questions to: [email protected]
Dataset Link
https://drive.google.com/drive/u/1/folders/15OXtWuMj2GYBAeYGo1EzJlcIzSco6Z1Q