Skip to content

Bug: Annotation can insert markers inside HTML tags when mapping back to original #17

@medelman17

Description

@medelman17

Problem

When text is cleaned (HTML stripped) before extraction, the annotation step maps citation positions back to the original HTML. This mapping can place annotation markers inside HTML tags, producing broken markup.

Example:

Original HTML: '145, <a id="p410" href="#p410">*410</a>11 <em>N. H.</em> 459. 1 Bla'
After cleaning: '145, 41011 N. H. 459. 1 Bla'
Citation found: '41011 N. H. 459' (spans include text from the removed anchor tag)

When annotating back onto the original HTML, the opening bracket could be inserted inside an HTML attribute, breaking the tag structure:

'145, <a id="p4{10" href="#p410">*410</a>11 <em>N. H.</em> 459}. 1 Bla'

Current Behavior

Our annotator silently skips citations whose spans don't map cleanly, which is safe but loses annotations. The underlying issue is that the TransformationMap can point into the middle of removed HTML content.

Possible Solutions

  1. Guard string encoding: Before cleaning, replace HTML tags with Unicode private-use-area characters to preserve their boundaries, then decode after annotation
  2. Span snapping: After mapping back to original, snap span boundaries to avoid landing inside HTML tags
  3. Tag-aware annotation: The annotator checks if insertion points are inside tags and adjusts

Upstream Reference

Python eyecite #180 — proposed solution uses Unicode private-use-area encoding to protect HTML tag boundaries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions