PDF Parser

Text extraction tool for client specific PDF documents. Purely written in Python3 with no external dependencies.

Parsing and extracting text from PDF could be treated as 5-steps process, done respectively in the following order -after decompressing the pdf file- :

traverse PDF logical tree to find all Page objects - in a way that guarantees the correct ordering-. Then for each Page:
Retrieve the Fonts information and ToUnicode Table.
Retrieve the contents.
Decode the contents.
Position the text into their right orders.

Usages:

python3 main.py [decompressed_pdf_file_name].txt > [out_file_name].txt

Note:

The python script extracts text from a document and does not recognize text in images.
The implementation here is optimized for parsing pdf with CID fonts ”Type0”, where fonts are explicitly referenced by the Page object and the ToUnicode tables are embedded within the pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
PDFStructure.png		PDFStructure.png
README.md		README.md
main.py		main.py
pdf_parser_classes.py		pdf_parser_classes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Parser

Usages:

Note:

Helpful Resources:

About

Uh oh!

Releases

Packages

Languages

Bodoral/pdf_parser

Folders and files

Latest commit

History

Repository files navigation

PDF Parser

Usages:

Note:

Helpful Resources:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages