Upsizing PDF pages to match `nemotron-parse` training data #1271

jamesbraza · 2026-01-22T05:29:28Z

On DOI 10.1016/j.neuron.2011.12.023 "Encoding of luminance and contrast by linear and nonlinear synapses in the retina"

Without this PR at temperature of 0, we fail to resolve a valid bounding box (e.g. we hit a negative xmin below on page 11) and get blown up with:

paperqa_nemotron.api.NemotronBBoxError: nemotron-parse response [[{"bbox": {"xmin": -0.0008105461393596986, "ymin": 0.0484, "xmax": 0.8993305712492153, "ymax": 0.0953}, "text": "**Neuron** # Encoding Luminance and Contrast in Retina Synapses", "type": "Page-header"}, {"bbox": {"xmin": 0.1164012554927809, "ymin": 0.8758, "xmax": 0.49627922159447585, "ymax": 0.9187}, "text": "was measured at contrasts varying between 10% and 100% (5 Hz square wave; Figure S6A). Each stimulus was applied from a steady background, which was varied over 4 log units,", "type": "Text"}, ... ]] has invalid bounding box.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

...

tenacity.RetryError: RetryError[<Future at 0x1469df950 state=finished raised NemotronBBoxError>]

This PR moves the nemotron-parse driver to resize the image to 2048-px height x 1648-px width, matching nemotron-parse's training regime. After this PR, we can successfully read in the PDF.

Note

Aligns input preprocessing with Nemotron-Parse training and corrects bbox -> original image coordinate mapping.

Add NEMOTRON_PARSE_TARGET_WIDTH/HEIGHT and fit_image_to_target_aspect_ratio() to scale/center pages onto a minimal canvas with the model’s aspect ratio
parse_pdf_to_pages: new optimize_aspect_ratio flag (default True); apply aspect-ratio fit before border padding for both bbox and no-bbox flows
Fix bbox back-projection by subtracting aspect/border offsets and dividing by scale in both detection fallback and media extraction
Extend tests: import new constants/util, and add thorough tests for fit_image_to_target_aspect_ratio()

^{Written by Cursor Bugbot for commit 3ebfbd6. Configure here.}

Copilot

Pull request overview

This PR updates the nemotron-based PDF reader to resize rendered pages to Nemotron-Parse’s native aspect ratio/dimensions before calling the API, and to correctly map bounding boxes back into original PDF coordinates. This is intended to avoid invalid bounding boxes (e.g., negative coordinates) and improve robustness when parsing PDFs like the cited Neuron article.

Changes:

Introduced NEMOTRON_PARSE_TARGET_WIDTH/HEIGHT and a new helper fit_image_to_target_aspect_ratio to scale and center page images onto a canvas matching Nemotron-Parse’s training aspect ratio.
Extended parse_pdf_to_pages to optionally apply this aspect-ratio optimization (enabled by default) for both media and non-media parsing paths, and adjusted bounding box coordinate transformations to account for aspect and border padding.
Added unit tests for fit_image_to_target_aspect_ratio covering scaling, aspect-ratio preservation, and mode preservation, and wired these constants/functions into existing tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`packages/paper-qa-nemotron/src/paperqa_nemotron/reader.py`	Adds Nemotron-Parse target dimensions, aspect-ratio fitting helper, and integrates aspect-ratio optimization and updated bbox back-projection into the PDF parsing pipeline.
`packages/paper-qa-nemotron/tests/test_paperqa_nemotron.py`	Imports the new constants/helper and adds focused tests validating `fit_image_to_target_aspect_ratio` behavior across image sizes and modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MicPie

lgtm, nice fix based on the training

…tests

jamesbraza requested review from MicPie, mskarlin, sidnarayanan and whitead January 22, 2026 05:29

jamesbraza self-assigned this Jan 22, 2026

Copilot AI review requested due to automatic review settings January 22, 2026 05:29

jamesbraza added the enhancement New feature or request label Jan 22, 2026

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 22, 2026

Copilot started reviewing on behalf of jamesbraza January 22, 2026 05:29 View session

dosubot bot added the bug Something isn't working label Jan 22, 2026

Copilot AI reviewed Jan 22, 2026

View reviewed changes

MicPie approved these changes Jan 23, 2026

View reviewed changes

jamesbraza force-pushed the nemotron-parse-aspect-ratio branch from d6cadbd to ddfa4d0 Compare January 27, 2026 21:52

jamesbraza added 2 commits January 27, 2026 14:36

Added image resizing for nemotron-parse to match training data, with …

13c09b6

…tests

Allowing asterisk around 'Paper Search' text

472da8c

jamesbraza force-pushed the nemotron-parse-aspect-ratio branch from ddfa4d0 to 472da8c Compare January 27, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsizing PDF pages to match `nemotron-parse` training data #1271

Upsizing PDF pages to match `nemotron-parse` training data #1271

jamesbraza commented Jan 22, 2026 •

edited by cursor bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

MicPie left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Upsizing PDF pages to match nemotron-parse training data #1271

Are you sure you want to change the base?

Upsizing PDF pages to match nemotron-parse training data #1271

Conversation

jamesbraza commented Jan 22, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

MicPie left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Upsizing PDF pages to match `nemotron-parse` training data #1271

Upsizing PDF pages to match `nemotron-parse` training data #1271

jamesbraza commented Jan 22, 2026 •

edited by cursor bot

Loading