Blackletter

A reference to blackletter law, this tool removes potentially copyrighted material from legal case law PDFs. This fulfills our goal of respecting any intellectual property rights others may have while making it possible to digitize and publish case law. This is essential to our mission of making case law accessible to all — a prerequisite for meaningful participation in our democracy.

Proprietary annotations removed from judicial opinions include headnotes, captions, and key cites.

Installation

pip install blackletter

Or install from source:

git clone https://github.com/freelawproject/blackletter
cd blackletter
pip install -e .

Quick Start

Command line:

blackletter process path/to/volume.pdf --reporter f3d --volume 952 --first-page 1 --output output/

Python:

from blackletter import process

process(
    "path/to/volume.pdf",
    "output/",
    reporter="f3d",
    volume="952",
    first_page=1,
)

This runs the full pipeline: OCR (if needed), YOLO detection, page number extraction, opinion splitting, and redaction — all in one pass.

How It Works

The process command runs a single-pass pipeline:

OCR (if needed): Detects image-only PDFs, downsamples pages, and adds a text layer via ocrmypdf/tesseract
Detection: Runs a YOLO model to identify proprietary elements (headnotes, captions, key cites, brackets, etc.) and structural elements (page numbers, dividers, footnotes)
Page Numbers: Extracts and validates page numbers using OCR on detected regions
Opinion Pairing: Matches case captions to key icons to identify opinion boundaries
Splitting & Redaction: Produces three output variants per opinion:
- Unredacted: Raw opinion pages extracted from the source
- Redacted: Potentially copyrighted content (headnotes, brackets, key icons) blacked out; non-opinion content whited out
- Masked: Optimized for LLM ingestion — only the opinion text is visible

Additionally produces:

A full redacted copy of the entire document
Extracted case law images (charts, photos, etc.) as PNGs
A detections.json export of all YOLO detections for review tooling

Models

Blackletter bundles three YOLO models, selected via CLI flags:

Flag	File	Classes	Description
(default)	`small.pt`	14	Bundled — fast, handles most cases
`--medium`	`medium.pt`	17	Bundled — better structural detection
`--large`	`large.pt`	21	Downloaded on first use from Hugging Face — highest accuracy, detects additional elements (editorial, judges, docket, court, citation, date)

The large model is hosted at flooie/blackletter-large and is downloaded automatically to blackletter/models/large.pt the first time --large is used.

Command Line Options

Process Command

blackletter process PDF [OPTIONS]

Positional Arguments:
  pdf                       Path to the source PDF

Options:
  --reporter STR            Reporter abbreviation (e.g. f3d, a3d)
  --volume STR              Volume number
  --first-page INT          Page number of the first page in the PDF (default: 1)
  -o, --output PATH         Base output directory (required)
  --model PATH              Path to custom YOLO model weights
  --medium                  Use the medium model (17 classes)
  --large                   Use the large model (21 classes, auto-downloaded)
  --footnotes               Extract footnotes into separate PDFs
  --unredacted              Also generate unredacted opinion PDFs
  --no-shrink               Skip downsampling (default: shrink to ~148 KB/page)
  --optimize {0,1,2,3}      ocrmypdf optimization level (default: 1)
  --bitonal                 Convert to 1-bit B&W before processing (for already-bitonal scans)
  --detect-only             Stop after detection and pairing — no PDFs written (Phase 1 only)

Validate Command

QA tool that checks a PDF's page number sequence for missing, duplicate, or misnumbered pages. Uses YOLO to locate page number regions, then PaddleOCR to read them, with Tesseract and GLM-OCR as fallbacks.

blackletter validate path/to/volume.pdf
blackletter validate path/to/volume.pdf --first-page 100 --last-page 500
blackletter validate path/to/volume.pdf --json

If the filename follows the convention reporter.volume.first.last.pdf (e.g. sct.143.1.888.pdf), the expected page range is inferred automatically.

Features:

Parallel OCR across multiple workers
Auto-correction of consistent OCR misreadings (e.g. systematic off-by-800 errors)
Detection of gaps, duplicates, backwards jumps, and page ranges (e.g. "31-32")
Structural checks for blank pages and orientation changes

Requires optional dependencies: pip install blackletter[analyze]

Draw Command

Visualize YOLO detections on a PDF — useful for debugging model output:

blackletter draw path/to/volume.pdf --output annotated.pdf
blackletter draw path/to/volume.pdf --output annotated.pdf --labels CASE_CAPTION KEY_ICON HEADNOTE
blackletter draw path/to/volume.pdf --output annotated.pdf --large

Output Structure

output/<reporter>/<volume>/<first-page>/
    <reporter>.<volume>.<first>.<last>.pdf   # OCR'd/processed source PDF
    <reporter>.<volume>.redacted.pdf         # Full redacted document

    detections.json       # All YOLO detections (label, bbox, confidence per page)
    pages_meta.json       # Column bounds and midpoints per page
    opinions.json         # Opinion pairs with outside-opinion rects
    redaction_rects.json  # Precomputed redaction rectangles (used by review UI)
    margin_rects.json     # Margin cleanup rectangles

    images/               # Extracted case law images (PNGs)
    unredacted/           # Individual opinion PDFs (raw, no redaction)
    redacted/             # Individual opinion PDFs (copyrighted content redacted)
    masked/               # Individual opinion PDFs (for LLM ingestion)

The JSON files are designed for use with a review UI — they allow manual inspection and adjustment of detections and redaction boundaries before final output is committed.

Detection Labels

Labels detected across all models (availability depends on model size):

Label	Models	Description
KEY_ICON	all	West key cite icons
DIVIDER	all	Opinion section dividers
PAGE_HEADER	all	Running headers
CASE_CAPTION	all	Opinion title/parties
FOOTNOTES	all	Footnote sections
HEADNOTE_BRACKET	all	Bracketed headnote markers
CASE_METADATA	all	Court, date, counsel info
CASE_SEQUENCE	all	Docket/case sequence numbers
PAGE_NUMBER	all	Page numbers
STATE_ABBREVIATION	all	State abbreviation markers
IMAGE	all	Photos, charts, diagrams
HEADNOTE	all	Headnote text
BACKGROUND	all	Background/procedural history region
SYLLABUS	all	Supreme Court syllabus sections
EDITORIAL	medium, large	Editorial notes
JUDGES	medium, large	Judge name blocks
TEXT_COLUMN	medium, large	Column boundaries
DOCKET	large	Docket number regions
DATE	large	Decision date regions
COURT	large	Court name regions
CITATION	large	Reporter citation regions

Margin Cleanup

After redaction, Blackletter automatically white-outs scan artifacts in page margins using the PDF text layer to find content boundaries. Pages with narrow text spans (appendices, image pages) are skipped automatically.

Requirements

Python 3.12+
Tesseract OCR (for image-only PDFs)

Install tesseract:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt install tesseract-ocr

License

GNU Affero General Public License v3

Contributing

Contributions welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
blackletter		blackletter
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Blackletter

Installation

Quick Start

How It Works

Models

Command Line Options

Process Command

Validate Command

Draw Command

Output Structure

Detection Labels

Margin Cleanup

Requirements

License

Contributing

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Blackletter

Installation

Quick Start

How It Works

Models

Command Line Options

Process Command

Validate Command

Draw Command

Output Structure

Detection Labels

Margin Cleanup

Requirements

License

Contributing

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages