-
Notifications
You must be signed in to change notification settings - Fork 137
Description
NewsInterview Corpus
A collection of 500 two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), containing 16,396 utterances from 860 speakers. The dataset focuses on journalistic interviews between interviewers and sources, from 2000 to 2020.
Dataset curated and introduced in: NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews (Spangher et al., ACL 2025)
Corpus translated into ConvoKit format by: Justin Lovelace
Dataset details
Speaker-level information
Speakers are identified by unique IDs with the following metadata:
- display_name: Original speaker name as appears in transcript
- role: Speaker type - one of:
HOST- Interview host/anchor (76 speakers)GUEST- Interview subject/interviewee, default type if not specified (738 speakers)BYLINE- Reporter/correspondent (46 speakers)
- programs: List of programs this speaker appears in
- num_interviews: Total number of interviews participated in
Utterance-level information
Each utterance corresponds to a single speaking turn in an interview.
- id: Unique utterance identifier
- speaker: Speaker ID reference
- conversation_id: Interview ID this utterance belongs to
- reply_to: ID of previous utterance (for threading)
- timestamp: Time marker (if available)
- text: The actual utterance text
Additional metadata includes:
- interview_id: Original interview identifier
- turn_order: Position in conversation sequence
- program: NPR/CNN program name
- date: Interview broadcast date
- url: Source URL (when available)
Conversation-level information
Each conversation represents a complete interview with the following metadata:
- title: Interview title (when available)
- summary: Interview summary or description
- program: Source program name (63 unique programs total)
- date: Broadcast/publication date (ranging from 2000 to 2020)
- url: Original source URL
- info_items: Extracted information items from interview
- info_items_dict: Structured version of information items
- outlines: Interview objectives/outline
Corpus-level information
The corpus contains:
- 500 conversations (interviews)
- 16,396 utterances
- 860 unique speakers
- 63 distinct programs
Top programs by interview count:
- All Things Considered (155)
- Morning Edition (74)
- Weekend Edition Saturday (32)
- Day to Day (22)
- Weekend Edition Sunday (21)
Usage
To download directly with ConvoKit:
from convokit import Corpus, download
corpus = Corpus(filename=download("news-interview-corpus"))For some quick stats:
corpus.print_summary_stats()
# Number of Speakers: 860
# Number of Utterances: 16396
# Number of Conversations: 500Additional notes
Data License
The associated research paper is published under CC BY 4.0. However, the dataset repository does not specify a license for the dataset.
Dataset Access
The original dataset can be accessed from the authors' GitHub repository at: https://github.com/alex2awesome/news-interview-question-generation
The corpus is available as a zipped ConvoKit corpus along with the conversion script and an example analysis script at: https://drive.google.com/drive/folders/1OP1vHx9mDRKRMB8oox3D869Fx11F43DZ?usp=sharing
Note on Dataset Size
The original paper mentions ~40,000 interviews; however, the dataset file from the GitHub repository contains 500 conversations.
Citation
If you use this dataset, please cite:
@inproceedings{spangher-etal-2025-newsinterview,
title = "{N}ews{I}nterview: a Dataset and a Playground to Evaluate {LLM}s' Grounding Gap via Informational Interviews",
author = "Spangher, Alexander and
Lu, Michael and
Kalyan, Sriya and
Cho, Hyundong Justin and
Huang, Tenghao and
Shi, Weiyan and
May, Jonathan",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1580/",
doi = "10.18653/v1/2025.acl-long.1580",
pages = "32895--32925",
ISBN = "979-8-89176-251-0",
abstract = "Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs' strategic dialogue capabilities."
}