Length-0 streams are read incorrectly, which breaks some PDFs

`DictionaryObject.read_from_stream` contains this code:

```python
            if length is None:  # if the PDF is damaged
                length = -1
            pstart = stream.tell()
            if length > 0:
                data["__streamdata__"] = stream.read(length)
            else:
                data["__streamdata__"] = read_until_regex(
                    stream, re.compile(b"endstream")
                )
```

Since `read_until_regex` doesn't strip the trailing newline, this will read almost all length-0 streams as `b"\n"` or `b"\r\n"` instead of `b""`.

I have some PDFs with creator `PFU ScanSnap Manager 5.1.30 #S1500` that contain JBIG2-encoded pages with `/JBIG2Globals` pointing to an empty stream object. After loading and saving them with pypdf, the `/JBIG2Globals` stream is invalid, and some (not all) PDF viewers fail to render the pages.

Suggested fix:

* If there exist broken PDFs in the wild with `/Length 0` followed by a stream of nonzero length that pypdf needs to support, check for `stream\r?\n\r?\n?endstream` as a special case first before falling back to `read_until_regex`, to ensure that valid PDFs with length-0 streams are always read correctly.
* Or, if there are no such PDFs, and `length > 0` was just meant to catch the `-1` case, change the test to `length >= 0`.
* In the `read_until_regex` case, if `endstream` is preceded by `\r` then strip it, or if it's preceded by `\r\n` then strip the `\n`, and strip the `\r` also iff `stream` was followed by `\r`. That isn't guaranteed to work, but it's probably the best one can do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions