Current State
The cleaner pipeline handles:
- ✅ NFKC normalization (ligatures, compatibility chars)
- ✅ Smart/curly quotes → straight quotes (U+201C/D, U+2018/9)
Missing
- ❌ Em-dash (—, U+2014) → hyphen-minus (-)
- ❌ En-dash (–, U+2013) → hyphen-minus (-)
- ❌ Non-breaking space (U+00A0) → regular space
- ❌ OCR artifacts: accented characters in English text (é→e, ü→u, etc.)
- ❌ Other typographical symbols (′ prime → ', · middle dot → ., etc.)
Impact
OCR'd legal documents frequently contain these characters, causing citation patterns to fail silently. For example, an em-dash between case name and citation, or a non-breaking space in "42 U.S.C." can prevent matches.
Proposed Approach
Add a normalizePunctuation cleaner to the default pipeline:
function normalizePunctuation(text: string): string {
return text
.replace(/[\u2014\u2015]/g, '-') // em-dash, horizontal bar
.replace(/[\u2013]/g, '-') // en-dash
.replace(/[\u00A0]/g, ' ') // non-breaking space
.replace(/[\u2032\u2035]/g, "'") // prime marks
}
OCR cleanup could be an opt-in cleaner since it's more aggressive.
Upstream Reference
Python eyecite #50
Current State
The cleaner pipeline handles:
Missing
Impact
OCR'd legal documents frequently contain these characters, causing citation patterns to fail silently. For example, an em-dash between case name and citation, or a non-breaking space in "42 U.S.C." can prevent matches.
Proposed Approach
Add a
normalizePunctuationcleaner to the default pipeline:OCR cleanup could be an opt-in cleaner since it's more aggressive.
Upstream Reference
Python eyecite #50