Skip to content

ROB: Handle /Pages node without /Kids during flattening#3825

Open
gaoflow wants to merge 1 commit into
py-pdf:mainfrom
gaoflow:fix-3811-pages-without-kids
Open

ROB: Handle /Pages node without /Kids during flattening#3825
gaoflow wants to merge 1 commit into
py-pdf:mainfrom
gaoflow:fix-3811-pages-without-kids

Conversation

@gaoflow
Copy link
Copy Markdown

@gaoflow gaoflow commented Jun 2, 2026

Summary

Reading a PDF whose page tree has a /Pages node typed /Type /Pages but without a /Kids entry (e.g. a malformed document advertising /Count 0 and no children) raises a bare KeyError: '/Kids' instead of being handled gracefully. This is the issue reported in #3811.

Cause

In _flatten, the node type is decided like this:

if PagesAttributes.TYPE in pages:
    t = cast(str, pages[PagesAttributes.TYPE])
elif PagesAttributes.KIDS not in pages:   # only reclassifies when /Type is absent
    t = "/Page"
else:
    t = "/Pages"

The existing fallback only treats a node as a single page when /Type is missing. When /Type is explicitly /Pages but /Kids is absent, the code still enters the /Pages branch and iterates pages[PagesAttributes.KIDS], which raises KeyError.

Fix

Treat a missing /Kids as an empty array, so a /Pages container with no children simply contributes no pages. len(reader.pages) then returns 0 for such a document instead of crashing. Documents that do have /Kids are unaffected (the key is still looked up exactly as before, preserving indirect-reference resolution).

Reproduction

from pypdf import PdfReader
# /Pages object: << /Type /Pages /Count 0 >>  (no /Kids)
reader = PdfReader("file.pdf")
print(len(reader.pages))   # before: KeyError: '/Kids'; after: 0

Tests

Added test_flatten__pages_without_kids, which removes /Kids from a real document's /Pages node, sets /Count 0, and asserts len(reader.pages) == 0. It fails with KeyError: '/Kids' on main and passes with this change. Existing multi-page documents still report the correct page counts.

Closes #3811

A page tree node typed as /Pages but missing the /Kids entry (for
example a malformed document advertising /Count 0 with no children)
caused _flatten to raise a bare KeyError: '/Kids' while iterating the
kids. Treat a missing /Kids as an empty array so such files report 0
pages instead of crashing.

Closes py-pdf#3811
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.73%. Comparing base (52545c5) to head (5c04709).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3825   +/-   ##
=======================================
  Coverage   97.73%   97.73%           
=======================================
  Files          55       55           
  Lines       10417    10418    +1     
  Branches     1931     1931           
=======================================
+ Hits        10181    10182    +1     
  Misses        130      130           
  Partials      106      106           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pypdf/_doc_common.py
inherit[attr] = pages[attr]
pages_reference = getattr(pages, "indirect_reference", object())
for page in cast(ArrayObject, pages[PagesAttributes.KIDS]):
# A malformed /Pages node may be missing /Kids (for example a page
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this comment contributes enough to be useful here.

Comment thread pypdf/_doc_common.py
# A malformed /Pages node may be missing /Kids (for example a page
# tree advertising "/Count 0" without any children). Treat it as
# having no kids instead of raising a bare KeyError here (#3811).
kids = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written in a simpler way:

Suggested change
kids = (
kids = pages.get(PagesAttributes.KIDS, ArrayObject())

Comment thread pypdf/_doc_common.py
if PagesAttributes.KIDS in pages
else ArrayObject()
)
for page in cast(ArrayObject, kids):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at it (and although not required here), I would recommend replacing this cast and increase the resilience here.

What I mean is that we should use an empty ArrayObject if the kids are a NullObject and raise a proper exception if we see anything different from an ArrayObject for the iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Properly deal with page count of 0

2 participants