Coalesce <br><br> hard-break runs into a paragraph break#125
Merged
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #121
Google Docs export-html marks paragraph boundaries with `<br><br>`, which Turndown faithfully emits as two consecutive trailing-two-space-newline pairs (` \n \n`). CommonMark renderers (GitHub, MkDocs, Pandoc) then treat that as two `<br>`s inside one `<p>`, cramming captions against images with no vertical spacing — and the "blank" line in the markdown source actually carries trailing whitespace, polluting diffs. Fix: after Turndown runs, collapse runs of two or more adjacent hard breaks to `\n\n`. Two or more adjacent hard breaks in the source HTML always mean "paragraph break", never "stack of <br>s". Applied in both `convertHtmlToMarkdown` (used by `--capture api`) and `convertHtmlToMarkdownEnhanced`. The Rust converter (html2md crate) already handled this correctly, so only JS source needed a fix; new Rust integration test guards parity. Also syncs `rust/Cargo.lock` to the 0.3.14 version that auto-bumped on main without a corresponding lockfile update. Refs #121.
<p> (visible in renderers)
Collaborator
Author
Working session summaryBoth empty so far. The original until-loop background job is running and will notify on completion. Let me wait passively. This summary was automatically extracted from the AI working session output. |
Collaborator
Author
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost: $8.091367📊 Context and tokens usage:Claude Opus 4.7: (2 sub-sessions)
Total: (2.5K new + 157.2K cache writes + 11.8M cache reads) input tokens, 48.7K output tokens, $8.091367 cost 🤖 Models used:
📎 Log file uploaded as Gist (3475KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
Collaborator
Author
✅ Ready to mergeThis pull request is now ready to be merged:
Monitored by hive-mind with --auto-restart-until-mergeable flag |
This reverts commit 0037017.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #121.
Summary
JS
--capture api(export-html → markdown) emitted trailing two-space CommonMark hard breaks (\n) on lines that should end paragraphs. Two adjacent hard breaks in the markdown source render as<br><br>inside a single<p>(GitHub, MkDocs, Pandoc), cramming captions against images with no vertical spacing — and the "blank" separator line actually carries trailing whitespace, polluting diffs.Root cause
Google Docs export-html wraps every visual line break in
<br>, including<br><br>at paragraph boundaries:Turndown faithfully emits one trailing-two-space-newline pair for each
<br>, producing\n \nbetween the two captions. CommonMark says two or more adjacent hard breaks still belong to one paragraph, so renderers cram both captions into a single<p>. The Rusthtml2mdcrate already coalesces consecutive<br>s correctly, so this is a JS-only bug.Fix
After Turndown runs, collapse runs of two or more adjacent hard breaks to
\n\n:Applied in both
convertHtmlToMarkdown(used by--capture api) andconvertHtmlToMarkdownEnhanced. The regex tolerates any post-processing that might leave 2+ trailing spaces on a hard break.Tests
New regression tests with a shared fixture (
paragraph-vs-line-break.html) modelled on the issue's spec — two captions separated by<br><br>inside a single<p>:js/tests/unit/paragraph-vs-line-break.test.js— verifies bothconvertHtmlToMarkdownandconvertHtmlToMarkdownEnhancedproduce a truly empty separator line (no trailing whitespace) and that markdown-it CommonMark renders the two captions as separate<p>s.rust/tests/integration/paragraph_vs_line_break.rs— same checks as a parity guard (Rust converter never had the bug).Test plan
cd js && npm test -- tests/unit— 257/261 pass; the 4 failures are pre-existingbrowser.test.jsfailures from a missing Playwright Chromium binary in the sandbox, unrelated to this change.cd js && npm test -- tests/unit/paragraph-vs-line-break.test.js tests/unit/html2md.test.js tests/unit/postprocess.test.js tests/unit/gdocs.test.js tests/unit/html2md-br-in-list-item.test.js— all pass.cd rust && cargo test— 92 unit + 96 integration tests pass, 0 failed.cd js && npm run lint— 0 errors.cd js && npx prettier --checkon changed files — all formatted.cd rust && cargo clippy --all-targets -- -D warnings— clean.Files changed
js/src/lib.js—coalesceBrRunsToParagraphBreakhelper + two call sites.js/.changeset/issue-121-paragraph-vs-line-break.md— patch changeset.js/tests/fixtures/paragraph-vs-line-break.html,js/tests/unit/paragraph-vs-line-break.test.js— JS regression test (usesmarkdown-itto verify CommonMark rendering).js/package.json,js/package-lock.json,js/yarn.lock— addmarkdown-itas a dev dependency.rust/tests/fixtures/paragraph-vs-line-break.html,rust/tests/integration/paragraph_vs_line_break.rs,rust/tests/integration/mod.rs— Rust parity test.rust/Cargo.lock— sync to 0.3.14 (main bumped Cargo.toml without lockfile update).experiments/repro-br-paragraph.mjs— reproduction script that pinpointed the bug.