Skip to content

Conversation

@stefan6419846
Copy link
Collaborator

This changes does indeed contain multiple fixes, but as all of them are about partially broken PDF files and possibly security-related, I decided to put them into one changeset.

Besides renaming variables to make them more readable, this includes the following changes:

  1. When searching through a PDF file which does not define a /Root entry in the trailer, while employing a large /Size value inside the trailer, this would lead to us trying to access each object number until the limit defined by /Size has been reached. This behavior can now be controlled by a new parameter to PdfReader which defaults to a more sensible default.

  2. When a broken startxref table is discovered, we try to re-build it from scratch. This used a regex-based approach, which turned out to be problematic with files consisting of lots of whitespace characters. By replacing the regex-based approach by a manual search based upon string.find(), we were able to drastically improve the performance in such cases.

  3. When flattening the pages of a PDF file, having one of the /Kids of the /Pages catalog entry reference the /Pages entry again would run until Python detects a recursion error itself. This has been changed to explicitly check for such cyclic references.

This changes does indeed contain mulitple fixes, but as all of them are
about partially broken PDF files and possibly security-related, I
decided to put them into one changeset.

Besides renaming variables to make them more readable, this includes the
following changes:

1. When searching through a PDF file which does not define a `/Root`
   entry in the trailer, while employing a large `/Size` value inside
   the trailer, this would lead to us trying to access each object
   number until the limit defined by `/Size` has been reached.
   This behavior can now be controlled by a new parameter to
   `PdfReader` which defaults to a more sensible default.

2. When a broken `startxref` table is discovered, we try to re-build
   it from scratch. This used a regex-based approach, which turned out
   to be problematic with files consisting of lots of whitespace
   characters. By replacing the regex-based approach by a manual search
   based upon `string.find()`, we were able to drastically improve the
   performance in such cases.

3. When flattening the pages of a PDF file, having one of the `/Kids`
   of the `/Pages` catalog entry reference the `/Pages` entry again
   would run until Python detects a recursion error itself. This has
   been changed to explicitly check for such cyclic references.
@stefan6419846
Copy link
Collaborator Author

Test file: root_object_recovery_limit.pdf

@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.35%. Comparing base (7126880) to head (5177c1e).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3594      +/-   ##
==========================================
+ Coverage   97.31%   97.35%   +0.03%     
==========================================
  Files          55       55              
  Lines        9769     9815      +46     
  Branches     1780     1791      +11     
==========================================
+ Hits         9507     9555      +48     
+ Misses        155      153       -2     
  Partials      107      107              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stefan6419846 stefan6419846 merged commit 2941657 into py-pdf:main Jan 9, 2026
18 checks passed
@stefan6419846 stefan6419846 deleted the reader-performance branch January 9, 2026 11:12
stefan6419846 added a commit that referenced this pull request Jan 9, 2026
## What's new

### Security (SEC)
- Improve handling of partially broken PDF files (#3594) by @stefan6419846

### Deprecations (DEP)
- Block common page content modifications when assigned to reader (#3582) by @stefan6419846

### New Features (ENH)
- Embellishments to generated text appearance streams (#3571) by @PJBrs

### Bug Fixes (BUG)
- Do not consider multi-byte BOM-like sequences as BOMs (#3589) by @stefan6419846

### Robustness (ROB)
- Avoid empty FlateDecode outputs without warning (#3579) by @stefan6419846

### Documentation (DOC)
- Add outlines documentation and link it in User Guide (#3511) by @mainuddin-md

### Developer Experience (DEV)
- Add PyPy 3.11 to test matrix and benchmarks (#3574) by @rassie

### Maintenance (MAINT)
- Fix compatibility with Pillow >= 12.1.0 (#3590) by @stefan6419846

[Full Changelog](6.5.0...6.6.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant