Supreme Court Corpus Extension #277

yash-chatha · 2025-05-06T01:18:40Z

Description

This pull request updates the Supreme Court Corpus to include cleaned and merged utterance data through 2023. The merging process preserves original utterance IDs from the legacy dataset while integrating non-overlapping new utterances from 2019 onward. Minor documentation updates reflecting the expanded dataset years (1955–2023) are also included.

Motivation and Context

This PR was created for the following reasons:

Incorporate recent Supreme Court transcripts (2019–2023).
Ensure backward compatibility by preserving old utterance IDs.
Address inconsistencies between old and new datasets, including resolving duplicate utterances and ensuring consistent speaker metadata.

How has this been tested?

The merged dataset was checked for validity through the use of python validation scripts that:

Verified that all old utterances are retained.
Confirmed that no utterance ID collisions occur.
Ensured only new case_ids (post-2019) contribute new utterances.
Check that all speakers in new utterances exist in the old speaker metadata.
Confirm no duplicate utterances (by content and ID) exist in the merged dataset.

We also conducted manual spot checks of sampled cases and utterances and cross-verified case_ids and conversation_ids between old, new, and merged datasets.

Other information

We updated the documentation to clarify which years the corpus now covers and explain the merging methodology.

change what is needed

- Included conversion notebook and rst file for DeliData - Reworked convokit to properly load both

- Added DeliData to README.md - Fixed Usage section for deli.rst to include load corpus - Added download link for delidata in download_config

Good merge

yash-chatha and others added 18 commits September 20, 2024 00:21

Update datasets documentation and add FOMC and NPR-2P datasets

f43aab2

bring NPR and FOMC datasets to convokit, making changes to the PR

0fb8408

Updated Fora documents to now match the upstream and appropriately

20cec64

change what is needed

Merge branch 'master' of https://github.com/YoshiWanK/ConvoKit

d152557

Rename npr-2P.rst to npr-2p.rst

9af8fe9

Add conversion files to dataset-examples folder

57aa71a

Merge branch 'master' of https://github.com/YoshiWanK/ConvoKit

a7616e8

fomc, npr-2p files updated for conversion notebook

8f9f05a

Add Deli dataset to convokit

56ffd48

- Included conversion notebook and rst file for DeliData - Reworked convokit to properly load both

Update files for Delidata for correct formatting

3f60128

- Added DeliData to README.md - Fixed Usage section for deli.rst to include load corpus - Added download link for delidata in download_config

Updated classifier and classifierModel

ea044c3

tentative done 1/x

bea8693

tentative done 2/x

beccf59

fixed formatting with black

6a88757

fixed formatting (2) with black

ca974a3

Merge branch 'master' into huggingface

18787ea

Merge branch 'test-HF' into deli-branch

6dc71aa

Good merge

Update SCOTUS extensions

8de0818

yash-chatha changed the title ~~Supreme~~ Supreme Court Corpus Extension May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supreme Court Corpus Extension #277

Supreme Court Corpus Extension #277

yash-chatha commented May 6, 2025

Supreme Court Corpus Extension #277

Are you sure you want to change the base?

Supreme Court Corpus Extension #277

Conversation

yash-chatha commented May 6, 2025

Description

Motivation and Context

How has this been tested?

Other information