Feature: Expand Unicode normalization (em-dashes, OCR artifacts)

## Current State

The cleaner pipeline handles:
- ✅ NFKC normalization (ligatures, compatibility chars)
- ✅ Smart/curly quotes → straight quotes (U+201C/D, U+2018/9)

## Missing

- ❌ Em-dash (—, U+2014) → hyphen-minus (-)
- ❌ En-dash (–, U+2013) → hyphen-minus (-)
- ❌ Non-breaking space (U+00A0) → regular space
- ❌ OCR artifacts: accented characters in English text (é→e, ü→u, etc.)
- ❌ Other typographical symbols (′ prime → ', · middle dot → ., etc.)

## Impact

OCR'd legal documents frequently contain these characters, causing citation patterns to fail silently. For example, an em-dash between case name and citation, or a non-breaking space in "42 U.S.C." can prevent matches.

## Proposed Approach

Add a `normalizePunctuation` cleaner to the default pipeline:

```typescript
function normalizePunctuation(text: string): string {
  return text
    .replace(/[\u2014\u2015]/g, '-')    // em-dash, horizontal bar
    .replace(/[\u2013]/g, '-')           // en-dash
    .replace(/[\u00A0]/g, ' ')           // non-breaking space
    .replace(/[\u2032\u2035]/g, "'")     // prime marks
}
```

OCR cleanup could be an opt-in cleaner since it's more aggressive.

## Upstream Reference

Python eyecite [#50](https://github.com/freelawproject/eyecite/issues/50)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Expand Unicode normalization (em-dashes, OCR artifacts) #11

Current State

Missing

Impact

Proposed Approach

Upstream Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Expand Unicode normalization (em-dashes, OCR artifacts) #11

Description

Current State

Missing

Impact

Proposed Approach

Upstream Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions