Skip to content

extract_text() return garbled characters #2330

Open
@ChanghaoLau

Description

@ChanghaoLau

I get garbled characters when parsing pdf file. The file I use is this. There may be encoding issues?

Environment

$ python -m platform
Linux-4.18.0-147.5.1.6.h841.eulerosv2r9.x86_64-x86_64-with-glibc2.17

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

file_path = '20120812.pdf'
page_idx = 0

reader = PdfReader(file_path)
page = reader.pages[page_idx]
text = page.extract_text()
print(text)

The pdf file can be obtained from this url.

The output is:

2012୍8ᄅ ACTA AUTOMATICA SINICA August, 2012
م
ᇛ ਟ1ࡹ1ྷ ೦2ᅦ ม1
ᅋေم, ྛऊো ,ۋ, ০Ⴈ
......

Metadata

Metadata

Assignees

No one assigned

    Labels

    Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions