perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
Draft
perf(pdf-server): lazy form extraction via range transport + incremental viewer scans#639
Conversation
…tal viewer scans Server: display_pdf now opens the document via PDFDataRangeTransport (disableAutoFetch) and only runs the per-page form/annotation walk when getFieldObjects() is non-empty, so form-free PDFs are probed with ~10-25% of bytes instead of a full download. The unused viewFieldInfo Map is removed. Viewer: getDocument sets disableAutoFetch/disableStream; baseline annotation scan and field-name mapping run lazily per rendered page instead of walking every page after load, so first paint no longer schedules a full-file pull. E2E: new range-counting HTTPS fixture (W-9 for forms, generated text+image PDF for no-forms) with stallAfterBytes control, and four regression tests asserting form fields are returned, <30% served on no-forms display_pdf, first page renders while later ranges are stalled, and overlap stays bounded.
…hema pdfjs getFieldObjects() returns the full field-tree array. For PDFs with a separated structure (pdf-lib, some authoring tools) the typed widget sits at fields[1+] behind a typeless container at fields[0]; the previous code only inspected fields[0] and skipped them all. Pick the first entry with a non-empty type instead. Makes the e2e forms.pdf fixture fully generated (no checked-in third-party asset on the hot path); fw9.pdf stays as a unit-test fixture for the hierarchical/XFA case.
@modelcontextprotocol/ext-apps
@modelcontextprotocol/server-basic-preact
@modelcontextprotocol/server-basic-react
@modelcontextprotocol/server-basic-solid
@modelcontextprotocol/server-basic-svelte
@modelcontextprotocol/server-basic-vanillajs
@modelcontextprotocol/server-basic-vue
@modelcontextprotocol/server-budget-allocator
@modelcontextprotocol/server-cohort-heatmap
@modelcontextprotocol/server-customer-segmentation
@modelcontextprotocol/server-debug
@modelcontextprotocol/server-map
@modelcontextprotocol/server-pdf
@modelcontextprotocol/server-scenario-modeler
@modelcontextprotocol/server-shadertoy
@modelcontextprotocol/server-sheet-music
@modelcontextprotocol/server-system-monitor
@modelcontextprotocol/server-threejs
@modelcontextprotocol/server-transcript
@modelcontextprotocol/server-video-resource
@modelcontextprotocol/server-wiki-explorer
commit: |
…e loss PdfCacheRangeTransport: - abort() is a no-op stub on PDFDataRangeTransport (it's the hook pdfjs calls *on* the transport, not an upstream error channel). Expose a .failed promise that rejects on the first fetch error and race every pdfjs await against it in display_pdf, so transient network errors surface into the existing catch instead of hanging the tool call. - pdfjs coalesces adjacent missing chunks into one unbounded requestDataRange; readPdfRange clamps each call to MAX_CHUNK_BYTES. Loop and deliver in slices so every requested chunk is marked loaded. Viewer (mcp-app.ts): - The lazy per-page baseline scan left pdfBaselineAnnotations incomplete, so persistAnnotations and getAnnotatedPdfBytes silently dropped restoredRemovedIds tombstones for unvisited pages. Union those ids into the computed diff and removedRefs. Test fixture: release stalled handlers before resetStats/close so a failing stalled test doesn't hang afterAll; fail fast if started with NODE_ENV=production. NODE_TLS_REJECT_UNAUTHORIZED scope documented (full per-process scoping needs a validateUrl localhost allow, tracked separately).
…opback HTTP PdfCacheRangeTransport.deliver(): pdf.js's reader is keyed by the original begin and removed after one delivery, so accumulate slices and call onDataRange once with the full buffer (the previous multi-call approach threw inside pdfjs). Covered by a new integration test that drives getDocument()/getPage(1) on a >1MB PDF through a clamping in-memory readPdfRange. validateUrl: allow http://127.0.0.1|localhost|[::1] only when PDF_SERVER_ALLOW_LOOPBACK_HTTP is set, so a remote deploy can't be made to probe its own ports. Covered by env-on/off unit tests. Fixture switched to plain HTTP (no openssl, no NODE_TLS_REJECT_UNAUTHORIZED). Adds /error.pdf (500s after 50KB) and two e2e tests: page 2 renders after stall release (>512KB object path), and display_pdf returns gracefully on mid-load 500. Existing <30% test now samples stats before the iframe loads. New unit tests: display_pdf returns (not hangs) on mid-load fetch failure via in-memory MCP client; computeDiff/serializeDiff contract tests pinning the restoredRemovedIds tombstone-preservation behaviour.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
display_pdfpreviously downloaded and parsed the entire PDF server-side to extract form metadata, defeating the viewer's chunked streaming. This PR makes both server and viewer truly incremental.Server (
server.ts):PdfCacheRangeTransportletsgetDocument()fetch only the byte ranges it needsdisplay_pdfcallsgetFieldObjects()first; if empty, skips the per-page form/annotation walk entirely → form-free PDFs are probed with ~10–25% of bytes instead of 100%extractFormSchemanow handles separated field/widget trees (pdf-lib, some authoring tools) by picking the first array entry with a non-emptytypeinstead of always[0]viewFieldInfoMap removedViewer (
mcp-app.ts):disableAutoFetch/disableStreamongetDocumentso pdfjs doesn't background-prefetch the whole fileRegression e2e (
tests/e2e/pdf-incremental-load.spec.ts+tests/helpers/range-counting-server.ts):stallAfterBytesdisplay_pdf, page 1 renders while later ranges stalled, byte-range overlap stays boundedMeasurements
display_pdfform-free PDF (520KB)display_pdfW-9 (form, 140KB) p50 latency, statelessOut of scope
startPreloading()(search-index builder) still walks all pages after first paint — separate UX tradeoffinitialDataTest plan
tsc --noEmitclean