Skip to content

Commit b089f21

Browse files
authored
Merge pull request #109 from link-assistant/issue-108-a4baf412d9cb
Fix Google Docs browser list semantics
2 parents 0422bcd + ac65ed6 commit b089f21

45 files changed

Lines changed: 26812 additions & 53 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Issue 108 Case Study: Google Docs Browser List Semantics
2+
3+
## Summary
4+
5+
Issue 108 reported that Google Docs browser capture rendered real ordered lists as unordered lists and misclassified an indented continuation paragraph as a blockquote. Both bugs reproduced in the JavaScript and Rust implementations because both parsers relied on the same `DOCS_modelChunk` heuristics.
6+
7+
The fix keeps the browser model as the main source for document text, inline styles, tables, and images, but augments paragraph semantics with the document's exported HTML. The export contains real `<ol>`, `<ul>`, and `<blockquote>` structure, so the parsers can align those semantic hints to model paragraphs by normalized text order instead of guessing from Google-internal list ids or indents.
8+
9+
## Timeline
10+
11+
- 2026-04-27 05:16 UTC: Issue 108 opened with a public v2 Google Docs reproducer and exact browser/API markdown differences.
12+
- 2026-04-27 06:09 UTC: Issue comment requested adding the v2 document to the live `GDOCS_INTEGRATION` matrix for JS and Rust.
13+
- 2026-04-27 06:16 UTC: Issue comment requested a repository-local case study with downloaded logs/data and online research.
14+
- 2026-04-27 06:17 UTC: PR 109 was opened as a draft from branch `issue-108-a4baf412d9cb`; initial CI on commit `4a745f662a4329eb753ca2fb097c5491769917da` passed.
15+
- 2026-04-27: Investigation downloaded issue, PR, CI, related issue/PR data, a live `DOCS_modelChunk` dump, the v2 export HTML, and local verification logs into this case-study folder.
16+
17+
## Captured Artifacts
18+
19+
- `data/issue.json`, `data/issue-comments.json`: original issue and follow-up requirements.
20+
- `data/pr-109.json`, `data/pr-109-comments.json`, `data/pr-109-review-comments.json`, `data/pr-109-reviews.json`: draft PR state and discussions.
21+
- `data/ci-runs.json`, `logs/javascript-checks-and-release-24979603248.log`, `logs/rust-checks-and-release-24979603264.log`: initial PR CI state.
22+
- `data/related-issue-100.json`, `data/related-issue-106.json`, `data/related-pr-101.json`, `data/related-pr-107.json`, `data/related-merged-prs.json`: nearby Google Docs parser work.
23+
- `data/code-search-DOCS_modelChunk.json`: related code search results.
24+
- `experiments/model-dump/model-data.json`: live browser `DOCS_modelChunk` capture from the public v2 document.
25+
- `experiments/model-dump/summary.json` and `experiments/model-analysis.json`: reduced model analysis used for the fix.
26+
- `experiments/markdown-test-document-v2-export.html`: exported HTML from the same v2 document.
27+
- `logs/model-dump.log`: model dump command log.
28+
29+
## Requirements
30+
31+
- Fix browser-mode Google Docs capture for ordered lists in both JavaScript and Rust.
32+
- Preserve unordered lists in the same Section 15 reproducer.
33+
- Prevent the continuation paragraph after `Step one` from becoming a blockquote.
34+
- Keep the existing public fixture coverage intact while adding the v2 reproducer to JS and Rust live integration tests.
35+
- Add a failing regression test before or with the fix.
36+
- Download and commit issue/PR/CI/research data under `docs/case-studies/issue-108/`.
37+
- Search for relevant online facts and document whether an upstream issue is needed.
38+
- Keep diagnostic or experiment scripts in `experiments/` for reuse.
39+
40+
## Root Causes
41+
42+
### Ordered Lists
43+
44+
`DOCS_modelChunk` exposes list ids, nesting, paragraph indents, and style records, but the captured Section 15 records did not include a stable ordered-vs-unordered marker signal. The old implementations tried to infer ordered lists from a hardcoded id allowlist plus fixture-specific item text such as "Parent item" and "ordered". That matched the previous public fixture but failed on ordinary content like `Apple`, `Banana`, and `Cherry`.
45+
46+
The v2 model dump shows the problem directly: Section 15 ordered list items use generated ids such as `kix.irei0efbjnvi`; the unordered control list uses `kix.t2auk5oln5j6`. Both shapes have similar list style metadata, so id/content inference is not defensible.
47+
48+
The original public fixture also showed a second export-alignment detail: a nested HTML `<li>`'s full text includes descendant list-item text. Semantic extraction must use a list item's own text while skipping nested `<ol>`/`<ul>` descendants, otherwise parent items such as `Parent item 1` will not align to the browser-model paragraph.
49+
50+
### Continuation Paragraph
51+
52+
The continuation paragraph has no list record, but its paragraph indent resembles quote-like content. The previous blockquote heuristic treated the equal left/first-line indent as a blockquote. In the export HTML, that same text is a plain `<p>`, so the browser model's indent alone was insufficient evidence.
53+
54+
### HTML Entity Decoding
55+
56+
The v2 document contains literal explanatory text with escaped HTML tags, including examples such as `&lt;ol&gt;` and `&lt;blockquote&gt;`. The API/export path decoded all HTML payloads before parsing, which could turn escaped text into real elements and corrupt markdown conversion. HTML format responses must be parsed as HTML, not globally entity-decoded first.
57+
58+
### Inherited Inline Styles
59+
60+
The model dump also exposed inherited italic ranges over Section 15 (`ts_it: true` with `ts_it_i: true`). Treating inherited true flags as explicit styles italicized content that was not italic in the document. The parser now only applies bold/italic/strike when the value is true and the inherited flag is not true.
61+
62+
## Online Research
63+
64+
- Google documents can be exported as byte content using Drive export mechanisms, and Google documents support web page/HTML export formats. See Google Drive API docs for [download/export behavior](https://developers.google.com/workspace/drive/api/guides/manage-downloads), [`files.export`](https://developers.google.com/workspace/drive/api/reference/rest/v3/files/export), and [export MIME formats](https://developers.google.com/workspace/drive/api/guides/ref-export-formats).
65+
- Google Docs export HTML represents ordered-list markers through normal HTML list structure and CSS counters. MDN documents that ordered lists have an implicit `list-item` counter and that `counter()`/`counters()` render counter values in generated content: [Using CSS counters](https://developer.mozilla.org/en-US/docs/Web/CSS/Guides/Counter_styles/Using_counters).
66+
- The existing browser capture flow instruments pages before Google Docs scripts load. Chrome DevTools Protocol documents `Page.addScriptToEvaluateOnNewDocument` as running scripts in frames before frame scripts execute: [CDP Page domain](https://chromedevtools.github.io/devtools-protocol/1-3/Page/#method-addScriptToEvaluateOnNewDocument).
67+
68+
No upstream issue was filed. The failure was in this repository's interpretation of captured Google Docs data, not a demonstrated defect in Google Docs, Chrome DevTools Protocol, `cheerio`, or `scraper`.
69+
70+
## Solution Considered
71+
72+
The rejected path was to expand the list-id allowlist or text regex. That would remain unstable because Google list ids are document-local and item text is arbitrary.
73+
74+
The implemented path extracts semantic hints from export HTML:
75+
76+
- Parse export HTML with `cheerio` in JS and `scraper` in Rust.
77+
- Walk paragraph/list item/blockquote text in document order.
78+
- For `<li>` hints, use only the list item's own text and skip nested lists.
79+
- Normalize whitespace for robust matching.
80+
- Align hints to model paragraphs using a forward-only cursor, so repeated nearby text does not match earlier content out of order.
81+
- Use matching hints to set `ordered`, `quote`, or plain paragraph semantics.
82+
- Keep browser-model extraction as the source of content and use export HTML only for semantics the model does not expose reliably.
83+
- Fall back to existing behavior when export HTML is unavailable, while removing the hardcoded ordered-list allowlist.
84+
85+
This approach reuses already-present dependencies and standard HTML parsers. It avoids implementing CSS counter evaluation because the DOM already distinguishes `<ol>` from `<ul>` before marker rendering.
86+
87+
## Implemented Changes
88+
89+
- JavaScript:
90+
- `captureGoogleDocWithBrowser` fetches export HTML from the same public document page and passes it to `parseGoogleDocsModelChunks`.
91+
- `parseGoogleDocsModelChunks` applies export semantic hints before rendering.
92+
- HTML-format fetches are no longer globally entity-decoded.
93+
- inherited true inline-style flags are ignored for bold/italic/strike.
94+
- v2 public doc coverage was added for API and browser live integration tests.
95+
96+
- Rust:
97+
- `fetch_google_doc_from_model` fetches public export HTML and applies semantic hints during model parsing.
98+
- `parse_model_chunks_with_export_html` was added for deterministic unit/integration tests.
99+
- HTML-format fetches are no longer globally entity-decoded.
100+
- inherited true inline-style flags are ignored for bold/italic/strike.
101+
- v2 public doc coverage was added for API and browser live integration tests.
102+
103+
- Tests:
104+
- JS and Rust regression tests reproduce Section 15's ambiguous ordered list, continuation paragraph, unordered list, and inherited italic records.
105+
- Existing ordered-list and nested-list model tests now supply export HTML hints instead of relying on hardcoded ids.
106+
- Live integration tests include the v2 public document in both JS and Rust when `GDOCS_INTEGRATION` is enabled.
107+
108+
## Verification
109+
110+
Local checks were run with logs saved under `ci-logs/` and copied into this case-study `logs/` directory before finalizing.
111+
112+
- `npm test -- --runTestsByPath tests/unit/gdocs.test.js --runInBand`
113+
- `npm test -- --runTestsByPath tests/integration/gdocs-public-doc.test.js --runInBand`
114+
- `npm test -- --testPathIgnorePatterns="docker.test.js"`
115+
- `GDOCS_INTEGRATION=true BROWSER_ENGINE=puppeteer npm test -- --runTestsByPath tests/integration/gdocs-public-doc.test.js --runInBand --testNamePattern "issue #108 v2"`
116+
- `GDOCS_INTEGRATION=true npm test -- --testPathPattern="gdocs-public-doc" --testTimeout=120000`
117+
- `cargo test --test integration gdocs:: -- --nocapture`
118+
- `cargo test --test integration gdocs_public_doc:: -- --nocapture`
119+
- `GDOCS_INTEGRATION=1 cargo test --test integration issue_108_v2 -- --nocapture`
120+
- `GDOCS_INTEGRATION=1 cargo test --test integration gdocs_public_doc::live -- --nocapture`
121+
- `cargo test --all-features --verbose`
122+
- `npm run format:check`
123+
- `npm run lint`
124+
- `npm run check:duplication`
125+
- `node scripts/validate-changeset.mjs`
126+
- `cargo fmt --all -- --check`
127+
- `cargo clippy --all-targets --all-features -- -D warnings`
128+
129+
Local notes:
130+
131+
- `npm ci` completed with a Node engine warning because this machine has Node 20.20.2 and the package expects Node >=22 <23.
132+
- The first JS live browser attempt used the default Playwright engine and failed because the local Playwright browser binary is not installed. The integration helper now honors `GDOCS_BROWSER_ENGINE`/`BROWSER_ENGINE`; the v2 live browser test passed locally with `BROWSER_ENGINE=puppeteer`. CI can continue using its default configured engine.
133+
134+
## Follow-Up Risks
135+
136+
- Export HTML is fetched for public browser captures. Private/authenticated Google Docs browser capture may not always have a public export URL available from the unauthenticated HTTP client. The current implementation logs and falls back if export HTML cannot be fetched.
137+
- Alignment is text-order based. It is intentionally conservative, but repeated identical paragraphs could still be ambiguous. The forward cursor makes this less likely than global text matching.
138+
- Future Google Docs model schema changes may expose better list marker metadata. If that happens, the semantic-hint layer can become a fallback instead of the primary ordered/unordered signal.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
[{"conclusion":"success","createdAt":"2026-04-27T06:17:26Z","databaseId":24979603264,"headSha":"4a745f662a4329eb753ca2fb097c5491769917da","status":"completed","workflowName":"Rust Checks and Release"},{"conclusion":"success","createdAt":"2026-04-27T06:17:26Z","databaseId":24979603248,"headSha":"4a745f662a4329eb753ca2fb097c5491769917da","status":"completed","workflowName":"JavaScript Checks and Release"}]

docs/case-studies/issue-108/data/code-search-DOCS_modelChunk.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
[{"url":"https://api.github.com/repos/link-assistant/web-capture/issues/comments/4324548439","html_url":"https://github.com/link-assistant/web-capture/issues/108#issuecomment-4324548439","issue_url":"https://api.github.com/repos/link-assistant/web-capture/issues/108","id":4324548439,"node_id":"IC_kwDOOlq2Js8AAAABAcNfVw","user":{"login":"konard","id":1431904,"node_id":"MDQ6VXNlcjE0MzE5MDQ=","avatar_url":"https://avatars.githubusercontent.com/u/1431904?v=4","gravatar_id":"","url":"https://api.github.com/users/konard","html_url":"https://github.com/konard","followers_url":"https://api.github.com/users/konard/followers","following_url":"https://api.github.com/users/konard/following{/other_user}","gists_url":"https://api.github.com/users/konard/gists{/gist_id}","starred_url":"https://api.github.com/users/konard/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/konard/subscriptions","organizations_url":"https://api.github.com/users/konard/orgs","repos_url":"https://api.github.com/users/konard/repos","events_url":"https://api.github.com/users/konard/events{/privacy}","received_events_url":"https://api.github.com/users/konard/received_events","type":"User","user_view_type":"public","site_admin":false},"created_at":"2026-04-27T06:09:15Z","updated_at":"2026-04-27T06:09:15Z","body":"**Follow-up: please add this reproducer doc to the CI matrix.**\n\nThe reproducer doc has been updated (renamed to **markdown-test-document-v2** and the trailing line moved to the actual end of the document — previously it sat between Section 14 and the new Section 15, which was a side-effect of how I added Section 15 incrementally):\n\n- **Doc URL:** https://docs.google.com/document/d/1Rvaod_u2wgkAUNdXG-e29yQem-P36KAQ/edit\n- **Sharing:** Anyone with the link, viewer (publicly accessible without auth)\n- **Final-line check:** the document now ends with the line \"End of Markdown Feature Test Document\" *after* Section 15, so any \"ends with the sentinel line\" assertion will see the new content as part of the doc body and not after the sentinel.\n\n### Suggested CI integration\n\nAdd this doc as a second `GDOCS_INTEGRATION` fixture alongside the existing `1f5zI2xOFpKa90v0GjamO_t7lqSdzMlaM`:\n\n- `js/tests/integration/gdocs-public-doc.test.js` — add a second test case with the v2 doc URL.\n- `rust/tests/integration/gdocs_public_doc.rs` — add a second `gdocs_public_doc::live` variant for the v2 doc.\n\nSuggested assertions to catch the regressions covered in this issue:\n\n1. **Bug 1 (numbered list → `<ul>`):** the markdown output must contain `1. Apple`, `2. Banana`, `3. Cherry` (not `- Apple` / `- Banana` / `- Cherry`). Asserting on the literal `1. Apple` substring is sufficient.\n2. **Bug 2 (continuation paragraph → `<blockquote>`):** the markdown output must NOT contain `> Continuation paragraph that is not a list item.` (with leading `> `). It should appear as a plain or indented paragraph instead.\n3. **Control case (`<ul>` stays `<ul>`):** Red/Green/Blue should remain bulleted (`- Red` / `- Green` / `- Blue`).\n\nBoth `--capture browser` AND `--capture api` should be tested; today only `--capture api` would pass on the v2 doc, and that's the right baseline.\n\nThe existing fixture (`1f5zI2xOFpKa90v0GjamO_t7lqSdzMlaM`) should remain in the matrix — it's still valuable as a backwards-compatibility check — but it's not sufficient on its own because, as noted in the issue body, the heuristics in `infer_ordered_list` were over-fitted to that fixture's exact content and list IDs.","author_association":"MEMBER","pin":null,"reactions":{"url":"https://api.github.com/repos/link-assistant/web-capture/issues/comments/4324548439/reactions","total_count":0,"+1":0,"-1":0,"laugh":0,"hooray":0,"confused":0,"heart":0,"rocket":0,"eyes":0},"performed_via_github_app":null},{"url":"https://api.github.com/repos/link-assistant/web-capture/issues/comments/4324577681","html_url":"https://github.com/link-assistant/web-capture/issues/108#issuecomment-4324577681","issue_url":"https://api.github.com/repos/link-assistant/web-capture/issues/108","id":4324577681,"node_id":"IC_kwDOOlq2Js8AAAABAcPRkQ","user":{"login":"konard","id":1431904,"node_id":"MDQ6VXNlcjE0MzE5MDQ=","avatar_url":"https://avatars.githubusercontent.com/u/1431904?v=4","gravatar_id":"","url":"https://api.github.com/users/konard","html_url":"https://github.com/konard","followers_url":"https://api.github.com/users/konard/followers","following_url":"https://api.github.com/users/konard/following{/other_user}","gists_url":"https://api.github.com/users/konard/gists{/gist_id}","starred_url":"https://api.github.com/users/konard/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/konard/subscriptions","organizations_url":"https://api.github.com/users/konard/orgs","repos_url":"https://api.github.com/users/konard/repos","events_url":"https://api.github.com/users/konard/events{/privacy}","received_events_url":"https://api.github.com/users/konard/received_events","type":"User","user_view_type":"public","site_admin":false},"created_at":"2026-04-27T06:16:28Z","updated_at":"2026-04-27T06:16:28Z","body":"We need to download all logs and data related about the issue to this repository, make sure we compile that data to `./docs/case-studies/issue-{id}` folder, and use it to do deep case study analysis (also make sure to search online for additional facts and data), in which we will reconstruct timeline/sequence of events, list of each and all requirements from the issue, find root causes of the each problem, and propose possible solutions and solution plans for each requirement (we should also check known existing components/libraries, that solve similar problem or can help in solutions).\n\nIf there is not enough data to find actual root cause, add debug output and verbose mode if not present, that will allow us to find root cause on next iteration.\n\nIf issue related to any other repository/project, where we can report issues on GitHub, please do so. Each issue must contain reproducible examples, workarounds and suggestions for fix the issue in code.","author_association":"MEMBER","pin":null,"reactions":{"url":"https://api.github.com/repos/link-assistant/web-capture/issues/comments/4324577681/reactions","total_count":0,"+1":0,"-1":0,"laugh":0,"hooray":0,"confused":0,"heart":0,"rocket":0,"eyes":0},"performed_via_github_app":null}]

0 commit comments

Comments
 (0)