Skip to content

Cannot extract text from a certain page in a document, due to unexpected low number of operands in a cm operator. #3262

Open
@nihohit

Description

@nihohit

Chrome & MacOS' Preview open the PDF without any issue.

pdf-online validator's output:


File | example.pdf
-- | --
Compliance | pdf1.7
Result | Document does not conform to PDF/A.
Details | Validating file "example.pdf" for conformance level pdf1.7The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 8226 in font 'STSong-Light' is missing.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 169 in font 'STSong-Light' is missing.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 9899 in font 'STSong-Light' is missing.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 8226 in font 'STSong-Light' is missing.The encoding for character code 183 in font 'STSong-Light' is missing.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 8226 in font 'STSong-Light' is missing.The encoding for character code 9899 in font 'STSong-Light' is missing.The name F1 of a font resource is unknown.The name FormXob.a31602a3f14463f4d5d3143608a8d452 of a xobject resource is unknown.The encoding for character code 8226 in font 'STSong-Light' is missing.

Environment

$ python -m platform
macOS-15.3.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.0, crypt_provider=('cryptography', '43.0.0'), PIL=none

Also recreated on Ubunto 22.04 & Jupyter notebook.

Code + PDF

import pdb
import sys
import traceback
from pypdf import PdfReader

reader = PdfReader("example.pdf")
for page in reader.pages:
  if page.page_number != 49:
    continue

  try:
    text = page.extract_text()
  except Exception as e:
    _, _, tb = sys.exc_info()
    traceback.print_exc()  # Optional: print the full traceback
    pdb.post_mortem(tb)

I cannot share the PDF, since it contains proprietary information, nor do I know how it was encoded.

Traceback

WARNING:pypdf._page:Impossible to decode XFormObject /FormXob.a31602a3f14463f4d5d3143608a8d452: '/XObject'
Traceback (most recent call last):
  File "<ipython-input-4-2a61f328ceb7>", line 6, in <cell line: 0>
    text = page.extract_text()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pypdf/_page.py", line 2378, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pypdf/_page.py", line 2148, in _extract_text
    process_operation(operator, operands)
  File "/usr/local/lib/python3.11/dist-packages/pypdf/_page.py", line 1961, in process_operation
    cm_matrix = mult(
                ^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pypdf/_text_extraction/__init__.py", line 72, in mult
    m[2] * n[0] + m[3] * n[2],
                  ~^^^
IndexError: list index out of range
> /usr/local/lib/python3.11/dist-packages/pypdf/_text_extraction/__init__.py(72)mult()
     70         m[0] * n[0] + m[1] * n[2],
     71         m[0] * n[1] + m[1] * n[3],
---> 72         m[2] * n[0] + m[3] * n[2],
     73         m[2] * n[1] + m[3] * n[3],
     74         m[4] * n[0] + m[5] * n[2] + n[4],

m is [0.70278, 65.3, 163.36]

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustnessneeds-pdfThe issue needs a PDF file to show the problemworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions