Skip to content

[Dataset] NewsInterview Dataset #314

@justinlovelace

Description

@justinlovelace

NewsInterview Corpus

A collection of 500 two-person informational interviews from National Public Radio (NPR) and Cable News Network (CNN), containing 16,396 utterances from 860 speakers. The dataset focuses on journalistic interviews between interviewers and sources, from 2000 to 2020.

Dataset curated and introduced in: NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews (Spangher et al., ACL 2025)

Corpus translated into ConvoKit format by: Justin Lovelace

Dataset details

Speaker-level information

Speakers are identified by unique IDs with the following metadata:

  • display_name: Original speaker name as appears in transcript
  • role: Speaker type - one of:
    • HOST - Interview host/anchor (76 speakers)
    • GUEST - Interview subject/interviewee, default type if not specified (738 speakers)
    • BYLINE - Reporter/correspondent (46 speakers)
  • programs: List of programs this speaker appears in
  • num_interviews: Total number of interviews participated in

Utterance-level information

Each utterance corresponds to a single speaking turn in an interview.

  • id: Unique utterance identifier
  • speaker: Speaker ID reference
  • conversation_id: Interview ID this utterance belongs to
  • reply_to: ID of previous utterance (for threading)
  • timestamp: Time marker (if available)
  • text: The actual utterance text

Additional metadata includes:

  • interview_id: Original interview identifier
  • turn_order: Position in conversation sequence
  • program: NPR/CNN program name
  • date: Interview broadcast date
  • url: Source URL (when available)

Conversation-level information

Each conversation represents a complete interview with the following metadata:

  • title: Interview title (when available)
  • summary: Interview summary or description
  • program: Source program name (63 unique programs total)
  • date: Broadcast/publication date (ranging from 2000 to 2020)
  • url: Original source URL
  • info_items: Extracted information items from interview
  • info_items_dict: Structured version of information items
  • outlines: Interview objectives/outline

Corpus-level information

The corpus contains:

  • 500 conversations (interviews)
  • 16,396 utterances
  • 860 unique speakers
  • 63 distinct programs

Top programs by interview count:

  • All Things Considered (155)
  • Morning Edition (74)
  • Weekend Edition Saturday (32)
  • Day to Day (22)
  • Weekend Edition Sunday (21)

Usage

To download directly with ConvoKit:

from convokit import Corpus, download
corpus = Corpus(filename=download("news-interview-corpus"))

For some quick stats:

corpus.print_summary_stats()
# Number of Speakers: 860
# Number of Utterances: 16396
# Number of Conversations: 500

Additional notes

Data License

The associated research paper is published under CC BY 4.0. However, the dataset repository does not specify a license for the dataset.

Dataset Access

The original dataset can be accessed from the authors' GitHub repository at: https://github.com/alex2awesome/news-interview-question-generation

The corpus is available as a zipped ConvoKit corpus along with the conversion script and an example analysis script at: https://drive.google.com/drive/folders/1OP1vHx9mDRKRMB8oox3D869Fx11F43DZ?usp=sharing

Note on Dataset Size

The original paper mentions ~40,000 interviews; however, the dataset file from the GitHub repository contains 500 conversations.

Citation

If you use this dataset, please cite:

@inproceedings{spangher-etal-2025-newsinterview,
    title = "{N}ews{I}nterview: a Dataset and a Playground to Evaluate {LLM}s' Grounding Gap via Informational Interviews",
    author = "Spangher, Alexander  and
      Lu, Michael  and
      Kalyan, Sriya  and
      Cho, Hyundong Justin  and
      Huang, Tenghao  and
      Shi, Weiyan  and
      May, Jonathan",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1580/",
    doi = "10.18653/v1/2025.acl-long.1580",
    pages = "32895--32925",
    ISBN = "979-8-89176-251-0",
    abstract = "Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs' strategic dialogue capabilities."
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions