Skip to content

Supreme Court Corpus Extension #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

yash-chatha
Copy link
Contributor

Description

This pull request updates the Supreme Court Corpus to include cleaned and merged utterance data through 2023. The merging process preserves original utterance IDs from the legacy dataset while integrating non-overlapping new utterances from 2019 onward. Minor documentation updates reflecting the expanded dataset years (1955–2023) are also included.

Motivation and Context

This PR was created for the following reasons:

  • Incorporate recent Supreme Court transcripts (2019–2023).
  • Ensure backward compatibility by preserving old utterance IDs.
  • Address inconsistencies between old and new datasets, including resolving duplicate utterances and ensuring consistent speaker metadata.

How has this been tested?

The merged dataset was checked for validity through the use of python validation scripts that:

  • Verified that all old utterances are retained.
  • Confirmed that no utterance ID collisions occur.
  • Ensured only new case_ids (post-2019) contribute new utterances.
  • Check that all speakers in new utterances exist in the old speaker metadata.
  • Confirm no duplicate utterances (by content and ID) exist in the merged dataset.

We also conducted manual spot checks of sampled cases and utterances and cross-verified case_ids and conversation_ids between old, new, and merged datasets.

Other information

We updated the documentation to clarify which years the corpus now covers and explain the merging methodology.

@yash-chatha yash-chatha changed the title Supreme Supreme Court Corpus Extension May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants