Skip to content

Fix partial paragraph highlighting disappearing in {{content}} (Fixes #446)#854

Open
namcusamlc wants to merge 1 commit into
obsidianmd:mainfrom
namcusamlc:fix/partial-p-highlights
Open

Fix partial paragraph highlighting disappearing in {{content}} (Fixes #446)#854
namcusamlc wants to merge 1 commit into
obsidianmd:mainfrom
namcusamlc:fix/partial-p-highlights

Conversation

@namcusamlc
Copy link
Copy Markdown

This PR resolves an issue where partially selected highlights on pages with dense <p> tags (such as Gemini App and Investopedia) would fail to render or partially disappear when evaluating the template's {{content}} variable. This directly addresses GitHub Issue #446 ("BUG: {{content}} no longer adding highlights properly").

The Problem

Saved highlights are stored with precise XPaths, text offsets, and character lengths relative to the original page's DOM (e.g. fullHtml). However, the template extraction pipeline previously:

  1. Stripped Highlight Metadata: The getPageContent response payload was stripping all rich highlight metadata—including xpath, startOffset, endOffset, and id—leaving only a flat array of plain text strings.
  2. Discarded DOM Context: Because XPaths and offsets were stripped, the template content extractor had to rely purely on complex and fragile regex-like fallback text searches matching against the defuddled HTML.
  3. Dense <p> Mismatches: On pages with dense <p> elements, partial selections (selecting only a few words in a paragraph rather than the full block) could not be matched accurately by plain text searches, leading to highlights silently failing or disappearing during markdown generation.
  4. Fragile Range Wrapping: The fallback text search used range.surroundContents to inject <mark> tags, which crashes in standard browser engines if a selection crosses structural tag boundaries (e.g., inline formatting tags like <i>, <strong>, <em>, or <a>).

The Solution

  1. Preserved Highlight Payload: Updated the page extraction message signatures (in src/content.ts and src/utils/content-extractor.ts) to return the full AnyHighlightData objects containing XPaths and offsets rather than simple text strings.
  2. XPath-Preserving Pipeline: Passed fullHtml directly to processHighlights. We now parse the original document DOM first, evaluate the exact stored xpath coordinates to find the target element, and apply the highlight there.
  3. Robust DOM-Range Extraction: Replaced range.surroundContents with a robust range.extractContents() pattern. This extracts the content within the highlighted range and appends it inside a new <mark> node before inserting it back, completely avoiding crashes when selections cross inline HTML boundaries.
  4. Offset-Based Text Wrapping:
    • Introduced a precise findTextNodeAtOffset helper that traverses text nodes using a TreeWalker to pinpoint exact starting and ending offsets for partial selections.
  5. Seamless Defuddle Extraction: After applying highlights to the raw page DOM, we pass the highlighted document directly through DefuddleClass.parse(). This lets Defuddle extract the article structure with the <mark> tags intact, generating flawless markdown highlights.

Verification Plan

Manual Verification

  • Verified highlight embedding on pages with dense <p> tags (Investopedia, Gemini Web App).
  • Verified partial paragraph selections render consistently in the final template {{content}} and "Clip to Obsidian" output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant