Skip to content

Coalesce <br><br> hard-break runs into a paragraph break#125

Merged
konard merged 4 commits intomainfrom
issue-121-71566b31e0d5
May 6, 2026
Merged

Coalesce <br><br> hard-break runs into a paragraph break#125
konard merged 4 commits intomainfrom
issue-121-71566b31e0d5

Conversation

@konard
Copy link
Copy Markdown
Collaborator

@konard konard commented May 5, 2026

Fixes #121.

Summary

JS --capture api (export-html → markdown) emitted trailing two-space CommonMark hard breaks ( \n) on lines that should end paragraphs. Two adjacent hard breaks in the markdown source render as <br><br> inside a single <p> (GitHub, MkDocs, Pandoc), cramming captions against images with no vertical spacing — and the "blank" separator line actually carries trailing whitespace, polluting diffs.

Root cause

Google Docs export-html wraps every visual line break in <br>, including <br><br> at paragraph boundaries:

<p>
  Caption A:<br>
  <img ><br><br>
  Caption B:<br>
  <img >
</p>

Turndown faithfully emits one trailing-two-space-newline pair for each <br>, producing \n \n between the two captions. CommonMark says two or more adjacent hard breaks still belong to one paragraph, so renderers cram both captions into a single <p>. The Rust html2md crate already coalesces consecutive <br>s correctly, so this is a JS-only bug.

Fix

After Turndown runs, collapse runs of two or more adjacent hard breaks to \n\n:

function coalesceBrRunsToParagraphBreak(markdown) {
  return markdown.replace(/(?: {2,}\n){2,}/g, '\n\n');
}

Applied in both convertHtmlToMarkdown (used by --capture api) and convertHtmlToMarkdownEnhanced. The regex tolerates any post-processing that might leave 2+ trailing spaces on a hard break.

Tests

New regression tests with a shared fixture (paragraph-vs-line-break.html) modelled on the issue's spec — two captions separated by <br><br> inside a single <p>:

  • js/tests/unit/paragraph-vs-line-break.test.js — verifies both convertHtmlToMarkdown and convertHtmlToMarkdownEnhanced produce a truly empty separator line (no trailing whitespace) and that markdown-it CommonMark renders the two captions as separate <p>s.
  • rust/tests/integration/paragraph_vs_line_break.rs — same checks as a parity guard (Rust converter never had the bug).

Test plan

  • cd js && npm test -- tests/unit — 257/261 pass; the 4 failures are pre-existing browser.test.js failures from a missing Playwright Chromium binary in the sandbox, unrelated to this change.
  • cd js && npm test -- tests/unit/paragraph-vs-line-break.test.js tests/unit/html2md.test.js tests/unit/postprocess.test.js tests/unit/gdocs.test.js tests/unit/html2md-br-in-list-item.test.js — all pass.
  • cd rust && cargo test — 92 unit + 96 integration tests pass, 0 failed.
  • cd js && npm run lint — 0 errors.
  • cd js && npx prettier --check on changed files — all formatted.
  • cd rust && cargo clippy --all-targets -- -D warnings — clean.

Files changed

  • js/src/lib.jscoalesceBrRunsToParagraphBreak helper + two call sites.
  • js/.changeset/issue-121-paragraph-vs-line-break.md — patch changeset.
  • js/tests/fixtures/paragraph-vs-line-break.html, js/tests/unit/paragraph-vs-line-break.test.js — JS regression test (uses markdown-it to verify CommonMark rendering).
  • js/package.json, js/package-lock.json, js/yarn.lock — add markdown-it as a dev dependency.
  • rust/tests/fixtures/paragraph-vs-line-break.html, rust/tests/integration/paragraph_vs_line_break.rs, rust/tests/integration/mod.rs — Rust parity test.
  • rust/Cargo.lock — sync to 0.3.14 (main bumped Cargo.toml without lockfile update).
  • experiments/repro-br-paragraph.mjs — reproduction script that pinpointed the bug.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #121
@konard konard self-assigned this May 5, 2026
konard added 2 commits May 5, 2026 21:44
Google Docs export-html marks paragraph boundaries with `<br><br>`, which
Turndown faithfully emits as two consecutive trailing-two-space-newline
pairs (`  \n  \n`). CommonMark renderers (GitHub, MkDocs, Pandoc) then
treat that as two `<br>`s inside one `<p>`, cramming captions against
images with no vertical spacing — and the "blank" line in the markdown
source actually carries trailing whitespace, polluting diffs.

Fix: after Turndown runs, collapse runs of two or more adjacent hard
breaks to `\n\n`. Two or more adjacent hard breaks in the source HTML
always mean "paragraph break", never "stack of <br>s".

Applied in both `convertHtmlToMarkdown` (used by `--capture api`) and
`convertHtmlToMarkdownEnhanced`. The Rust converter (html2md crate)
already handled this correctly, so only JS source needed a fix; new
Rust integration test guards parity.

Also syncs `rust/Cargo.lock` to the 0.3.14 version that auto-bumped on
main without a corresponding lockfile update.

Refs #121.
@konard konard changed the title [WIP] Trailing two-space hard breaks on every line collapse paragraphs into single <p> (visible in renderers) Coalesce <br><br> hard-break runs into a paragraph break May 5, 2026
@konard konard marked this pull request as ready for review May 5, 2026 21:48
@konard
Copy link
Copy Markdown
Collaborator Author

konard commented May 5, 2026

Working session summary

Both empty so far. The original until-loop background job is running and will notify on completion. Let me wait passively.


This summary was automatically extracted from the AI working session output.

@konard
Copy link
Copy Markdown
Collaborator Author

konard commented May 5, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $8.091367

📊 Context and tokens usage:

Claude Opus 4.7: (2 sub-sessions)

  1. 116.8K / 1M (12%) input tokens, 33.4K / 128K (26%) output tokens
  2. 59.0K / 1M (6%) input tokens, 8.9K / 128K (7%) output tokens

Total: (2.5K new + 157.2K cache writes + 11.8M cache reads) input tokens, 48.7K output tokens, $8.091367 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (3475KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard
Copy link
Copy Markdown
Collaborator Author

konard commented May 5, 2026

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard merged commit a3b8d98 into main May 6, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trailing two-space hard breaks on every line collapse paragraphs into single <p> (visible in renderers)

1 participant