Open
Description
Hi there, we're trying to utilize this cool library to extract text for some processing, but it seems it failed on the attached PDF. It contains some Traditional Chinese characters but the output looks like some random characters.
Looks like this PDF is utilizing CFF based CIDFontType0C as subtype, wondering if that's not currently supported by pypdf? Let us know if there's anything we can help as well. Not super familiar but happy to help out.
Environment
Which environment were you using when you encountered the problem?
$ python3 -m platform
macOS-13.5-x86_64-i386-64bit
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=none
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader("caibao.pdf")
number_of_pages = len(reader.pages)
for i in range(number_of_pages):
page = reader.pages[i]
text = page.extract_text()
print(text)
Output:
2ࣨ
˜
ɚཧɚɧϋ
ʬ˜ɧɤ˚ɚཧɚɚϋ
ʬ˜ɧɤ˚ Νˢᜊਗ
ৰ̮�
ϗɝ 299,194 269,505 11%
ˣл 139,022 114,941 21%
л 80,729 67,284 20%
л 53,417 42,963 24%
л 52,009 42,032 24%
ɛ͏࿆ʩ�
Ñਿ͉ 5.486 4.407 24%
Ñᛅᑛ 5.334 4.320 23%
л 98,511 73,205 35%
л 70,086 53,684 31%
ɛ͏࿆ʩ�
Ñਿ͉ 7.393 5.628 31%
Ñᛅᑛ 7.236 5.516 31%
PDF:
caibao.pdf