Problem
When text is cleaned (HTML stripped) before extraction, the annotation step maps citation positions back to the original HTML. This mapping can place annotation markers inside HTML tags, producing broken markup.
Example:
Original HTML: '145, <a id="p410" href="#p410">*410</a>11 <em>N. H.</em> 459. 1 Bla'
After cleaning: '145, 41011 N. H. 459. 1 Bla'
Citation found: '41011 N. H. 459' (spans include text from the removed anchor tag)
When annotating back onto the original HTML, the opening bracket could be inserted inside an HTML attribute, breaking the tag structure:
'145, <a id="p4{10" href="#p410">*410</a>11 <em>N. H.</em> 459}. 1 Bla'
Current Behavior
Our annotator silently skips citations whose spans don't map cleanly, which is safe but loses annotations. The underlying issue is that the TransformationMap can point into the middle of removed HTML content.
Possible Solutions
- Guard string encoding: Before cleaning, replace HTML tags with Unicode private-use-area characters to preserve their boundaries, then decode after annotation
- Span snapping: After mapping back to original, snap span boundaries to avoid landing inside HTML tags
- Tag-aware annotation: The annotator checks if insertion points are inside tags and adjusts
Upstream Reference
Python eyecite #180 — proposed solution uses Unicode private-use-area encoding to protect HTML tag boundaries.
Problem
When text is cleaned (HTML stripped) before extraction, the annotation step maps citation positions back to the original HTML. This mapping can place annotation markers inside HTML tags, producing broken markup.
Example:
When annotating back onto the original HTML, the opening bracket could be inserted inside an HTML attribute, breaking the tag structure:
Current Behavior
Our annotator silently skips citations whose spans don't map cleanly, which is safe but loses annotations. The underlying issue is that the TransformationMap can point into the middle of removed HTML content.
Possible Solutions
Upstream Reference
Python eyecite #180 — proposed solution uses Unicode private-use-area encoding to protect HTML tag boundaries.