SEC: Improve handling of partially broken PDF files #3594
Merged
+236
−38
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This changes does indeed contain multiple fixes, but as all of them are about partially broken PDF files and possibly security-related, I decided to put them into one changeset.
Besides renaming variables to make them more readable, this includes the following changes:
When searching through a PDF file which does not define a
/Rootentry in the trailer, while employing a large/Sizevalue inside the trailer, this would lead to us trying to access each object number until the limit defined by/Sizehas been reached. This behavior can now be controlled by a new parameter toPdfReaderwhich defaults to a more sensible default.When a broken
startxreftable is discovered, we try to re-build it from scratch. This used a regex-based approach, which turned out to be problematic with files consisting of lots of whitespace characters. By replacing the regex-based approach by a manual search based uponstring.find(), we were able to drastically improve the performance in such cases.When flattening the pages of a PDF file, having one of the
/Kidsof the/Pagescatalog entry reference the/Pagesentry again would run until Python detects a recursion error itself. This has been changed to explicitly check for such cyclic references.