SEC: Improve handling of partially broken PDF files #3594

stefan6419846 · 2026-01-09T10:56:54Z

This changes does indeed contain multiple fixes, but as all of them are about partially broken PDF files and possibly security-related, I decided to put them into one changeset.

Besides renaming variables to make them more readable, this includes the following changes:

When searching through a PDF file which does not define a /Root entry in the trailer, while employing a large /Size value inside the trailer, this would lead to us trying to access each object number until the limit defined by /Size has been reached. This behavior can now be controlled by a new parameter to PdfReader which defaults to a more sensible default.
When a broken startxref table is discovered, we try to re-build it from scratch. This used a regex-based approach, which turned out to be problematic with files consisting of lots of whitespace characters. By replacing the regex-based approach by a manual search based upon string.find(), we were able to drastically improve the performance in such cases.
When flattening the pages of a PDF file, having one of the /Kids of the /Pages catalog entry reference the /Pages entry again would run until Python detects a recursion error itself. This has been changed to explicitly check for such cyclic references.

This changes does indeed contain mulitple fixes, but as all of them are about partially broken PDF files and possibly security-related, I decided to put them into one changeset. Besides renaming variables to make them more readable, this includes the following changes: 1. When searching through a PDF file which does not define a `/Root` entry in the trailer, while employing a large `/Size` value inside the trailer, this would lead to us trying to access each object number until the limit defined by `/Size` has been reached. This behavior can now be controlled by a new parameter to `PdfReader` which defaults to a more sensible default. 2. When a broken `startxref` table is discovered, we try to re-build it from scratch. This used a regex-based approach, which turned out to be problematic with files consisting of lots of whitespace characters. By replacing the regex-based approach by a manual search based upon `string.find()`, we were able to drastically improve the performance in such cases. 3. When flattening the pages of a PDF file, having one of the `/Kids` of the `/Pages` catalog entry reference the `/Pages` entry again would run until Python detects a recursion error itself. This has been changed to explicitly check for such cyclic references.

stefan6419846 · 2026-01-09T10:58:09Z

Test file: root_object_recovery_limit.pdf

codecov · 2026-01-09T11:09:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.35%. Comparing base (7126880) to head (5177c1e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3594      +/-   ##
==========================================
+ Coverage   97.31%   97.35%   +0.03%     
==========================================
  Files          55       55              
  Lines        9769     9815      +46     
  Branches     1780     1791      +11     
==========================================
+ Hits         9507     9555      +48     
+ Misses        155      153       -2     
  Partials      107      107

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stefan6419846

## What's new ### Security (SEC) - Improve handling of partially broken PDF files (#3594) by @stefan6419846 ### Deprecations (DEP) - Block common page content modifications when assigned to reader (#3582) by @stefan6419846 ### New Features (ENH) - Embellishments to generated text appearance streams (#3571) by @PJBrs ### Bug Fixes (BUG) - Do not consider multi-byte BOM-like sequences as BOMs (#3589) by @stefan6419846 ### Robustness (ROB) - Avoid empty FlateDecode outputs without warning (#3579) by @stefan6419846 ### Documentation (DOC) - Add outlines documentation and link it in User Guide (#3511) by @mainuddin-md ### Developer Experience (DEV) - Add PyPy 3.11 to test matrix and benchmarks (#3574) by @rassie ### Maintenance (MAINT) - Fix compatibility with Pillow >= 12.1.0 (#3590) by @stefan6419846 [Full Changelog](6.5.0...6.6.0)

stefan6419846 added 3 commits January 6, 2026 18:13

only determine the reference of /Pages once, not for each page

79bdd84

Merge branch 'main' into reader-performance

5b5e80e

add missing test file URL

5177c1e

stefan6419846 merged commit 2941657 into py-pdf:main Jan 9, 2026
18 checks passed

stefan6419846 deleted the reader-performance branch January 9, 2026 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEC: Improve handling of partially broken PDF files #3594

SEC: Improve handling of partially broken PDF files #3594

stefan6419846 commented Jan 9, 2026

Uh oh!

stefan6419846 commented Jan 9, 2026

Uh oh!

codecov bot commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SEC: Improve handling of partially broken PDF files #3594

SEC: Improve handling of partially broken PDF files #3594

Conversation

stefan6419846 commented Jan 9, 2026

Uh oh!

stefan6419846 commented Jan 9, 2026

Uh oh!

codecov bot commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Jan 9, 2026 •

edited

Loading