Skip to content

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639

Draft
ochafik wants to merge 4 commits intomainfrom
feat/pdf-lazy-form-extraction
Draft

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
ochafik wants to merge 4 commits intomainfrom
feat/pdf-lazy-form-extraction

Conversation

@ochafik
Copy link
Copy Markdown
Contributor

@ochafik ochafik commented Apr 24, 2026

Summary

display_pdf previously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental.

Server (server.ts):

  • New PdfCacheRangeTransport lets getDocument() fetch only the byte ranges it needs
  • display_pdf calls getFieldObjects() first; if empty, skips the per-page form/annotation walk entirely → form-free PDFs are probed with ~10–25% of bytes instead of 100%
  • extractFormSchema now handles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-empty type instead of always [0]
  • Dead viewFieldInfo Map removed

Viewer (mcp-app.ts):

  • disableAutoFetch / disableStream on getDocument so pdfjs doesn't background-prefetch the whole file
  • Baseline annotation scan and field-name map run lazily per rendered page instead of walking all pages after load

Regression e2e (tests/e2e/pdf-incremental-load.spec.ts + tests/helpers/range-counting-server.ts):

  • Self-signed HTTPS fixture serves two pdf-lib-generated PDFs (form-free 20pg ~520KB; 5-field form ~3KB), records every range request, supports stallAfterBytes
  • Asserts: form fields returned in initial response, <30% bytes on form-free display_pdf, page 1 renders while later ranges stalled, byte-range overlap stays bounded

Measurements

Before After
display_pdf form-free PDF (520KB) 100% downloaded 24.2% (3 range requests)
display_pdf W-9 (form, 140KB) p50 latency, stateless 0.555s 0.231s (2.4×)
Viewer first paint blocked on full file yes no — page 1 renders at ~48% served

Out of scope

  • startPreloading() (search-index builder) still walks all pages after first paint — separate UX tradeoff
  • Server↔viewer xref overlap (~25%) — would need server to pass fetched bytes as initialData

Test plan

  • 183 unit tests pass (4 new field-tree-handling tests covering pdf-lib, multi-widget, W-9/XFA, no-form)
  • tsc --noEmit clean
  • Server-side validated via direct MCP + range-counting fixture
  • First-page-under-stall validated via headless browser
  • CI runs the new e2e spec

ochafik added 2 commits April 24, 2026 17:15
…tal viewer scans

Server: display_pdf now opens the document via PDFDataRangeTransport
(disableAutoFetch) and only runs the per-page form/annotation walk when
getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25%
of bytes instead of a full download. The unused viewFieldInfo Map is
removed.

Viewer: getDocument sets disableAutoFetch/disableStream; baseline
annotation scan and field-name mapping run lazily per rendered page
instead of walking every page after load, so first paint no longer
schedules a full-file pull.

E2E: new range-counting HTTPS fixture (W-9 for forms, generated
text+image PDF for no-forms) with stallAfterBytes control, and four
regression tests asserting form fields are returned, <30% served on
no-forms display_pdf, first page renders while later ranges are stalled,
and overlap stays bounded.
…hema

pdfjs getFieldObjects() returns the full field-tree array. For PDFs with
a separated structure (pdf-lib, some authoring tools) the typed widget
sits at fields[1+] behind a typeless container at fields[0]; the previous
code only inspected fields[0] and skipped them all. Pick the first entry
with a non-empty type instead.

Makes the e2e forms.pdf fixture fully generated (no checked-in third-party
asset on the hot path); fw9.pdf stays as a unit-test fixture for the
hierarchical/XFA case.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Apr 24, 2026

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/@modelcontextprotocol/ext-apps@639

@modelcontextprotocol/server-basic-preact

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-preact@639

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-react@639

@modelcontextprotocol/server-basic-solid

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-solid@639

@modelcontextprotocol/server-basic-svelte

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-svelte@639

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vanillajs@639

@modelcontextprotocol/server-basic-vue

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vue@639

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/@modelcontextprotocol/server-budget-allocator@639

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/@modelcontextprotocol/server-cohort-heatmap@639

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/@modelcontextprotocol/server-customer-segmentation@639

@modelcontextprotocol/server-debug

npm i https://pkg.pr.new/@modelcontextprotocol/server-debug@639

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/@modelcontextprotocol/server-map@639

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/@modelcontextprotocol/server-pdf@639

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/@modelcontextprotocol/server-scenario-modeler@639

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/@modelcontextprotocol/server-shadertoy@639

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/@modelcontextprotocol/server-sheet-music@639

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/@modelcontextprotocol/server-system-monitor@639

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/@modelcontextprotocol/server-threejs@639

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/@modelcontextprotocol/server-transcript@639

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/@modelcontextprotocol/server-video-resource@639

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/@modelcontextprotocol/server-wiki-explorer@639

commit: ce4600f

ochafik added 2 commits April 24, 2026 19:43
…e loss

PdfCacheRangeTransport:
- abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs
  calls *on* the transport, not an upstream error channel). Expose a
  .failed promise that rejects on the first fetch error and race every
  pdfjs await against it in display_pdf, so transient network errors
  surface into the existing catch instead of hanging the tool call.
- pdfjs coalesces adjacent missing chunks into one unbounded
  requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES.
  Loop and deliver in slices so every requested chunk is marked loaded.

Viewer (mcp-app.ts):
- The lazy per-page baseline scan left pdfBaselineAnnotations incomplete,
  so persistAnnotations and getAnnotatedPdfBytes silently dropped
  restoredRemovedIds tombstones for unvisited pages. Union those ids into
  the computed diff and removedRefs.

Test fixture: release stalled handlers before resetStats/close so a
failing stalled test doesn't hang afterAll; fail fast if started with
NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full
per-process scoping needs a validateUrl localhost allow, tracked
separately).
…opback HTTP

PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the
original begin and removed after one delivery, so accumulate slices and
call onDataRange once with the full buffer (the previous multi-call
approach threw inside pdfjs). Covered by a new integration test that
drives getDocument()/getPage(1) on a >1MB PDF through a clamping
in-memory readPdfRange.

validateUrl: allow http://127.0.0.1|localhost|[::1] only when
PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made
to probe its own ports. Covered by env-on/off unit tests.

Fixture switched to plain HTTP (no openssl, no
NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two
e2e tests: page 2 renders after stall release (>512KB object path), and
display_pdf returns gracefully on mid-load 500. Existing <30% test now
samples stats before the iframe loads.

New unit tests: display_pdf returns (not hangs) on mid-load fetch
failure via in-memory MCP client; computeDiff/serializeDiff contract
tests pinning the restoredRemovedIds tombstone-preservation behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant