Skip to content

Feature: Expand Unicode normalization (em-dashes, OCR artifacts) #11

@medelman17

Description

@medelman17

Current State

The cleaner pipeline handles:

  • ✅ NFKC normalization (ligatures, compatibility chars)
  • ✅ Smart/curly quotes → straight quotes (U+201C/D, U+2018/9)

Missing

  • ❌ Em-dash (—, U+2014) → hyphen-minus (-)
  • ❌ En-dash (–, U+2013) → hyphen-minus (-)
  • ❌ Non-breaking space (U+00A0) → regular space
  • ❌ OCR artifacts: accented characters in English text (é→e, ü→u, etc.)
  • ❌ Other typographical symbols (′ prime → ', · middle dot → ., etc.)

Impact

OCR'd legal documents frequently contain these characters, causing citation patterns to fail silently. For example, an em-dash between case name and citation, or a non-breaking space in "42 U.S.C." can prevent matches.

Proposed Approach

Add a normalizePunctuation cleaner to the default pipeline:

function normalizePunctuation(text: string): string {
  return text
    .replace(/[\u2014\u2015]/g, '-')    // em-dash, horizontal bar
    .replace(/[\u2013]/g, '-')           // en-dash
    .replace(/[\u00A0]/g, ' ')           // non-breaking space
    .replace(/[\u2032\u2035]/g, "'")     // prime marks
}

OCR cleanup could be an opt-in cleaner since it's more aggressive.

Upstream Reference

Python eyecite #50

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions