Open
Description
I get garbled characters when parsing pdf file. The file I use is this. There may be encoding issues?
Environment
$ python -m platform
Linux-4.18.0-147.5.1.6.h841.eulerosv2r9.x86_64-x86_64-with-glibc2.17
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
file_path = '20120812.pdf'
page_idx = 0
reader = PdfReader(file_path)
page = reader.pages[page_idx]
text = page.extract_text()
print(text)
The pdf file can be obtained from this url.
The output is:
2012୍8ᄅ ACTA AUTOMATICA SINICA August, 2012
م
ᇛ ਟ1ࡹ1ྷ ೦2ᅦ ม1
ᅋေم, ྛऊো ,ۋ, ০Ⴈ
......