Replies: 1 comment
-
This issue is no longer reproducible:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm writing a script to automate extracting data from pdfs I receive. I'm using PyPDF2 to read the pdfs and extract the text to be interpreted. I've tested pdfs with two different formats. The script works perfectly for the first format. When trying it with the second format I'm getting an indexing error (below). After troubleshooting I've found the issue is due to the font used in the second format. They use "Roboto" while the first, successful format, uses Arial.
I've attached stripped-down versions of the pdfs that are causing issues. One in Roboto and one I manually changed to Arial.
test_pdf_arial.pdf
test_pdf_roboto.pdf
The snippet of code here is where I'm running into the issue:
The indexing error I'm getting is:
I've found that if I use the exact same pdf and all I change is the font from Roboto to Arial, PyPDF2 has no problem extracting the text. I've searched online and in the PyPDF2 documentation but I can't find any solution on how to get it to extract text in the Roboto font, or add the Roboto font to the PyPDF2 font library.
I'd really appreciate if anyone could provide some advice on how to solve this issue.
Note: manually changing the font from Roboto to Arial isn't a desirable option as I receive hundreds of these invoices monthly.
Beta Was this translation helpful? Give feedback.
All reactions