perf(pdf-server): lazy form extraction via range transport + incremental viewer scans by ochafik · Pull Request #639 · modelcontextprotocol/ext-apps

ochafik · 2026-04-24T18:26:29Z

Summary

display_pdf previously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental.

Server (server.ts):

New PdfCacheRangeTransport lets getDocument() fetch only the byte ranges it needs
display_pdf calls getFieldObjects() first; if empty, skips the per-page form/annotation walk entirely → form-free PDFs are probed with ~10–25% of bytes instead of 100%
extractFormSchema now handles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-empty type instead of always [0]
Dead viewFieldInfo Map removed

Viewer (mcp-app.ts):

disableAutoFetch / disableStream on getDocument so pdfjs doesn't background-prefetch the whole file
Baseline annotation scan and field-name map run lazily per rendered page instead of walking all pages after load

Regression e2e (tests/e2e/pdf-incremental-load.spec.ts + tests/helpers/range-counting-server.ts):

Self-signed HTTPS fixture serves two pdf-lib-generated PDFs (form-free 20pg ~520KB; 5-field form ~3KB), records every range request, supports stallAfterBytes
Asserts: form fields returned in initial response, <30% bytes on form-free display_pdf, page 1 renders while later ranges stalled, byte-range overlap stays bounded

Measurements

	Before	After
`display_pdf` form-free PDF (520KB)	100% downloaded	24.2% (3 range requests)
`display_pdf` W-9 (form, 140KB) p50 latency, stateless	0.555s	0.231s (2.4×)
Viewer first paint blocked on full file	yes	no — page 1 renders at ~48% served

Out of scope

startPreloading() (search-index builder) still walks all pages after first paint — separate UX tradeoff
Server↔viewer xref overlap (~25%) — would need server to pass fetched bytes as initialData

Test plan

183 unit tests pass (4 new field-tree-handling tests covering pdf-lib, multi-widget, W-9/XFA, no-form)
tsc --noEmit clean
Server-side validated via direct MCP + range-counting fixture
First-page-under-stall validated via headless browser
CI runs the new e2e spec

…tal viewer scans Server: display_pdf now opens the document via PDFDataRangeTransport (disableAutoFetch) and only runs the per-page form/annotation walk when getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25% of bytes instead of a full download. The unused viewFieldInfo Map is removed. Viewer: getDocument sets disableAutoFetch/disableStream; baseline annotation scan and field-name mapping run lazily per rendered page instead of walking every page after load, so first paint no longer schedules a full-file pull. E2E: new range-counting HTTPS fixture (W-9 for forms, generated text+image PDF for no-forms) with stallAfterBytes control, and four regression tests asserting form fields are returned, <30% served on no-forms display_pdf, first page renders while later ranges are stalled, and overlap stays bounded.

…hema pdfjs getFieldObjects() returns the full field-tree array. For PDFs with a separated structure (pdf-lib, some authoring tools) the typed widget sits at fields[1+] behind a typeless container at fields[0]; the previous code only inspected fields[0] and skipped them all. Pick the first entry with a non-empty type instead. Makes the e2e forms.pdf fixture fully generated (no checked-in third-party asset on the hot path); fw9.pdf stays as a unit-test fixture for the hierarchical/XFA case.

pkg-pr-new · 2026-04-24T18:29:41Z

Open in StackBlitz

@modelcontextprotocol/ext-apps

npm i https://pkg.pr.new/@modelcontextprotocol/ext-apps@639

@modelcontextprotocol/server-basic-preact

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-preact@639

@modelcontextprotocol/server-basic-react

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-react@639

@modelcontextprotocol/server-basic-solid

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-solid@639

@modelcontextprotocol/server-basic-svelte

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-svelte@639

@modelcontextprotocol/server-basic-vanillajs

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vanillajs@639

@modelcontextprotocol/server-basic-vue

npm i https://pkg.pr.new/@modelcontextprotocol/server-basic-vue@639

@modelcontextprotocol/server-budget-allocator

npm i https://pkg.pr.new/@modelcontextprotocol/server-budget-allocator@639

@modelcontextprotocol/server-cohort-heatmap

npm i https://pkg.pr.new/@modelcontextprotocol/server-cohort-heatmap@639

@modelcontextprotocol/server-customer-segmentation

npm i https://pkg.pr.new/@modelcontextprotocol/server-customer-segmentation@639

@modelcontextprotocol/server-debug

npm i https://pkg.pr.new/@modelcontextprotocol/server-debug@639

@modelcontextprotocol/server-map

npm i https://pkg.pr.new/@modelcontextprotocol/server-map@639

@modelcontextprotocol/server-pdf

npm i https://pkg.pr.new/@modelcontextprotocol/server-pdf@639

@modelcontextprotocol/server-scenario-modeler

npm i https://pkg.pr.new/@modelcontextprotocol/server-scenario-modeler@639

@modelcontextprotocol/server-shadertoy

npm i https://pkg.pr.new/@modelcontextprotocol/server-shadertoy@639

@modelcontextprotocol/server-sheet-music

npm i https://pkg.pr.new/@modelcontextprotocol/server-sheet-music@639

@modelcontextprotocol/server-system-monitor

npm i https://pkg.pr.new/@modelcontextprotocol/server-system-monitor@639

@modelcontextprotocol/server-threejs

npm i https://pkg.pr.new/@modelcontextprotocol/server-threejs@639

@modelcontextprotocol/server-transcript

npm i https://pkg.pr.new/@modelcontextprotocol/server-transcript@639

@modelcontextprotocol/server-video-resource

npm i https://pkg.pr.new/@modelcontextprotocol/server-video-resource@639

@modelcontextprotocol/server-wiki-explorer

npm i https://pkg.pr.new/@modelcontextprotocol/server-wiki-explorer@639

commit: ce4600f

…e loss PdfCacheRangeTransport: - abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs calls *on* the transport, not an upstream error channel). Expose a .failed promise that rejects on the first fetch error and race every pdfjs await against it in display_pdf, so transient network errors surface into the existing catch instead of hanging the tool call. - pdfjs coalesces adjacent missing chunks into one unbounded requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES. Loop and deliver in slices so every requested chunk is marked loaded. Viewer (mcp-app.ts): - The lazy per-page baseline scan left pdfBaselineAnnotations incomplete, so persistAnnotations and getAnnotatedPdfBytes silently dropped restoredRemovedIds tombstones for unvisited pages. Union those ids into the computed diff and removedRefs. Test fixture: release stalled handlers before resetStats/close so a failing stalled test doesn't hang afterAll; fail fast if started with NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full per-process scoping needs a validateUrl localhost allow, tracked separately).

…opback HTTP PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the original begin and removed after one delivery, so accumulate slices and call onDataRange once with the full buffer (the previous multi-call approach threw inside pdfjs). Covered by a new integration test that drives getDocument()/getPage(1) on a >1MB PDF through a clamping in-memory readPdfRange. validateUrl: allow http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made to probe its own ports. Covered by env-on/off unit tests. Fixture switched to plain HTTP (no openssl, no NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two e2e tests: page 2 renders after stall release (>512KB object path), and display_pdf returns gracefully on mid-load 500. Existing <30% test now samples stats before the iframe loads. New unit tests: display_pdf returns (not hangs) on mid-load fetch failure via in-memory MCP client; computeDiff/serializeDiff contract tests pinning the restoredRemovedIds tombstone-preservation behaviour.

ochafik added 2 commits April 24, 2026 17:15

ochafik added 2 commits April 24, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639

perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
ochafik wants to merge 4 commits intomainfrom
feat/pdf-lazy-form-extraction

ochafik commented Apr 24, 2026

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ochafik commented Apr 24, 2026

Summary

Measurements

Out of scope

Test plan

Uh oh!

pkg-pr-new Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pkg-pr-new Bot commented Apr 24, 2026 •

edited

Loading