Text extraction tool for client specific PDF documents. Purely written in Python3 with no external dependencies.
Parsing and extracting text from PDF could be treated as 5-steps process, done respectively in the following order -after decompressing the pdf file- :
- traverse PDF logical tree to find all Page objects - in a way that guarantees the correct ordering-. Then for each Page:
- Retrieve the Fonts information and ToUnicode Table.
- Retrieve the contents.
- Decode the contents.
- Position the text into their right orders.
python3 main.py [decompressed_pdf_file_name].txt > [out_file_name].txt
-
The python script extracts text from a document and does not recognize text in images.
-
The implementation here is optimized for parsing pdf with CID fonts ”Type0”, where fonts are explicitly referenced by the Page object and the ToUnicode tables are embedded within the pdf.