Skip to content

Wrong characters during extract_text with /Differences for font /TJQCZS+FzBookMaker2DlFont #2605

Open
@zailushang2006

Description

@zailushang2006

I need to extract text from a PDF document using the page.extract_text function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences mapping table, characters are mapped to special encodings as follows:

{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}

The font file is decoded using the specified /Filter: /FlateDecode under /Font->/FontDescriptor->/FontFile3, but the font file is garbled.

Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0

Code + PDFex

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader(pdf_path)

number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
    if i != 3:
        continue
    page = reader.pages[i]

    text = page.extract_text()
    print(text[:5000])

Share here the PDF file(s) that cause the issue.
GB+15322.2-2019.pdf

Traceback

This is the complete traceback I see:

page 3 (start 0):

84971221-CBF2-46dc-B435-6ADF2271A1D4

print result:

686E886A-E4B7-4bb5-9BAC-05A609334090

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-cjk-issueIssue related to CJK (Chinese-Japanese-Korean)workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions