PyPDF2 Font Read Issue #1429

Smythium · 2022-11-10T15:30:41Z

Smythium
Nov 10, 2022

I'm writing a script to automate extracting data from pdfs I receive. I'm using PyPDF2 to read the pdfs and extract the text to be interpreted. I've tested pdfs with two different formats. The script works perfectly for the first format. When trying it with the second format I'm getting an indexing error (below). After troubleshooting I've found the issue is due to the font used in the second format. They use "Roboto" while the first, successful format, uses Arial.

I've attached stripped-down versions of the pdfs that are causing issues. One in Roboto and one I manually changed to Arial.

test_pdf_arial.pdf
test_pdf_roboto.pdf

The snippet of code here is where I'm running into the issue:

import PyPDF2

pdf_roboto = r"C:\Users\Robert.Smyth\Python\test_pdf_roboto.pdf"
pdf_arial = r"C:\Users\Robert.Smyth\Python\test_pdf_arial.pdf"

reader = PyPDF2.PdfFileReader(pdf_roboto)
pageObj = reader.pages[0]
pages_text = pageObj.extractText()

The indexing error I'm getting is:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
C:\Users\ROBERT~1.SMY\AppData\Local\Temp/ipykernel_22076/669450932.py in <module>
      1 reader = PyPDF2.PdfFileReader(pdf_roboto)
      2 pageObj = reader.pages[0]
----> 3 pages_text = pageObj.extractText()

~\Anaconda3\lib\site-packages\PyPDF2\_page.py in extractText(self, Tj_sep, TJ_sep)
   1539         """
   1540         deprecate_with_replacement("extractText", "extract_text")
-> 1541         return self.extract_text()
   1542 
   1543     def _get_fonts(self) -> Tuple[Set[str], Set[str]]:

~\Anaconda3\lib\site-packages\PyPDF2\_page.py in extract_text(self, Tj_sep, TJ_sep, orientations, space_width, *args)
   1511             orientations = (orientations,)
   1512 
-> 1513         return self._extract_text(
   1514             self, self.pdf, orientations, space_width, PG.CONTENTS
   1515         )

~\Anaconda3\lib\site-packages\PyPDF2\_page.py in _extract_text(self, obj, pdf, orientations, space_width, content_key)
   1144         if "/Font" in resources_dict:
   1145             for f in cast(DictionaryObject, resources_dict["/Font"]):
-> 1146                 cmaps[f] = build_char_map(f, space_width, obj)
   1147         cmap: Tuple[Union[str, Dict[int, str]], Dict[str, str], str] = (
   1148             "charmap",

~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in build_char_map(font_name, space_width, obj)
     20     space_code = 32
     21     encoding, space_code = parse_encoding(ft, space_code)
---> 22     map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
     23 
     24     # encoding can be either a string for decode (on 1,2 or a variable number of bytes) of a char table (for 1 byte only for me)

~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in parse_to_unicode(ft, space_code)
    187     cm = prepare_cm(ft)
    188     for l in cm.split(b"\n"):
--> 189         process_rg, process_char = process_cm_line(
    190             l.strip(b" "), process_rg, process_char, map_dict, int_entry
    191         )

~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in process_cm_line(l, process_rg, process_char, map_dict, int_entry)
    247         process_char = False
    248     elif process_rg:
--> 249         parse_bfrange(l, map_dict, int_entry)
    250     elif process_char:
    251         parse_bfchar(l, map_dict, int_entry)

~\Anaconda3\lib\site-packages\PyPDF2\_cmap.py in parse_bfrange(l, map_dict, int_entry)
    256     lst = [x for x in l.split(b" ") if x]
    257     a = int(lst[0], 16)
--> 258     b = int(lst[1], 16)
    259     nbi = len(lst[0])
    260     map_dict[-1] = nbi // 2

IndexError: list index out of range

I've found that if I use the exact same pdf and all I change is the font from Roboto to Arial, PyPDF2 has no problem extracting the text. I've searched online and in the PyPDF2 documentation but I can't find any solution on how to get it to extract text in the Roboto font, or add the Roboto font to the PyPDF2 font library.

I'd really appreciate if anyone could provide some advice on how to solve this issue.

Note: manually changing the font from Roboto to Arial isn't a desirable option as I receive hundreds of these invoices monthly.

MartinThoma · 2023-11-14T10:51:16Z

MartinThoma
Nov 14, 2023
Maintainer

This issue is no longer reproducible:

from pypdf import PdfReader

pdf_roboto = r"test_pdf_roboto.pdf"
pdf_arial = r"test_pdf_arial.pdf"

reader = PdfReader(pdf_arial)
page = reader.pages[0]
pages_text = page.extract_text()
print(pages_text)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDF2 Font Read Issue #1429

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

PyPDF2 Font Read Issue #1429

Smythium Nov 10, 2022

Replies: 1 comment

MartinThoma Nov 14, 2023 Maintainer

Smythium
Nov 10, 2022

MartinThoma
Nov 14, 2023
Maintainer