Skip to content

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Open
@benrg

Description

@benrg

DictionaryObject.read_from_stream contains this code:

            if length is None:  # if the PDF is damaged
                length = -1
            pstart = stream.tell()
            if length > 0:
                data["__streamdata__"] = stream.read(length)
            else:
                data["__streamdata__"] = read_until_regex(
                    stream, re.compile(b"endstream")
                )

Since read_until_regex doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n" or b"\r\n" instead of b"".

I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500 that contain JBIG2-encoded pages with /JBIG2Globals pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals stream is invalid, and some (not all) PDF viewers fail to render the pages.

Suggested fix:

  • If there exist broken PDFs in the wild with /Length 0 followed by a stream of nonzero length that pypdf needs to support, check for stream\r?\n\r?\n?endstream as a special case first before falling back to read_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly.
  • Or, if there are no such PDFs, and length > 0 was just meant to catch the -1 case, change the test to length >= 0.
  • In the read_until_regex case, if endstream is preceded by \r then strip it, or if it's preceded by \r\n then strip the \n, and strip the \r also iff stream was followed by \r. That isn't guaranteed to work, but it's probably the best one can do.

Metadata

Metadata

Assignees

No one assigned

    Labels

    genericThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustness

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions