Open
Description
DictionaryObject.read_from_stream
contains this code:
if length is None: # if the PDF is damaged
length = -1
pstart = stream.tell()
if length > 0:
data["__streamdata__"] = stream.read(length)
else:
data["__streamdata__"] = read_until_regex(
stream, re.compile(b"endstream")
)
Since read_until_regex
doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n"
or b"\r\n"
instead of b""
.
I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500
that contain JBIG2-encoded pages with /JBIG2Globals
pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals
stream is invalid, and some (not all) PDF viewers fail to render the pages.
Suggested fix:
- If there exist broken PDFs in the wild with
/Length 0
followed by a stream of nonzero length that pypdf needs to support, check forstream\r?\n\r?\n?endstream
as a special case first before falling back toread_until_regex
, to ensure that valid PDFs with length-0 streams are always read correctly. - Or, if there are no such PDFs, and
length > 0
was just meant to catch the-1
case, change the test tolength >= 0
. - In the
read_until_regex
case, ifendstream
is preceded by\r
then strip it, or if it's preceded by\r\n
then strip the\n
, and strip the\r
also iffstream
was followed by\r
. That isn't guaranteed to work, but it's probably the best one can do.