Bug: Annotation can insert markers inside HTML tags when mapping back to original

## Problem

When text is cleaned (HTML stripped) before extraction, the annotation step maps citation positions back to the original HTML. This mapping can place annotation markers inside HTML tags, producing broken markup.

Example:
```
Original HTML: '145, <a id="p410" href="#p410">*410</a>11 <em>N. H.</em> 459. 1 Bla'
After cleaning: '145, 41011 N. H. 459. 1 Bla'
Citation found: '41011 N. H. 459' (spans include text from the removed anchor tag)
```

When annotating back onto the original HTML, the opening bracket could be inserted inside an HTML attribute, breaking the tag structure:
```
'145, <a id="p4{10" href="#p410">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
```

## Current Behavior

Our annotator silently skips citations whose spans don't map cleanly, which is safe but loses annotations. The underlying issue is that the TransformationMap can point into the middle of removed HTML content.

## Possible Solutions

1. **Guard string encoding**: Before cleaning, replace HTML tags with Unicode private-use-area characters to preserve their boundaries, then decode after annotation
2. **Span snapping**: After mapping back to original, snap span boundaries to avoid landing inside HTML tags
3. **Tag-aware annotation**: The annotator checks if insertion points are inside tags and adjusts

## Upstream Reference

Python eyecite [#180](https://github.com/freelawproject/eyecite/issues/180) — proposed solution uses Unicode private-use-area encoding to protect HTML tag boundaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Annotation can insert markers inside HTML tags when mapping back to original #17

Problem

Current Behavior

Possible Solutions

Upstream Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Annotation can insert markers inside HTML tags when mapping back to original #17

Description

Problem

Current Behavior

Possible Solutions

Upstream Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions