Description
I need to extract text from a PDF document using the page.extract_text
function, but all the extracted Chinese characters are garbled. I suspect that this PDF document uses several special Chinese fonts: /TJQCZS+FzBookMaker2DlFont
. I used debug to examine the source code of PyPDF, and in the /Font->/Encoding->/Differences
mapping table, characters are mapped to special encodings as follows:
{'/Differences': [35, '/G23', 36, '/G24', 37, '/G25', 38, '/G26', 39, '/G27', 40, '/G28', 41, '/G29', 42, '/G2A', 43, '/G2B', 44, '/G2C', 45, '/G2D', 46, '/G2E', 47, '/G2F', 48, '/G30', 49, '/G31'], '/Type': '/Encoding'}
The font file is decoded using the specified /Filter: /FlateDecode
under /Font->/FontDescriptor->/FontFile3
, but the font file is garbled.
Since Adobe Acrobat can display the text correctly, there must be another way to handle this. I am not very familiar with the structure and protocols of PDF documents, so I am unsure how to resolve this issue.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Windows-10-10.0.19044-SP0
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.2'), PIL=10.2.0
Code + PDFex
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
print(f"Number of pages: {number_of_pages}")
for i in range(number_of_pages):
if i != 3:
continue
page = reader.pages[i]
text = page.extract_text()
print(text[:5000])
Share here the PDF file(s) that cause the issue.
GB+15322.2-2019.pdf
Traceback
This is the complete traceback I see: