Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract

This project focuses on:

Training a deep learning model to detect tabular data on PDFs.
Detection and extraction of a specific PDF file with complex tables.
Cleaning of the data extracted.

It uses the model called "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml" from Detectron2 for training the model. The database used in the training was made by me, it contains 25 images in which the columns have been marked. For the detection of the text PyTesseract was used.

Note

PyTesseract can sometimes have issues reading words correctly. Some issues with OCR accuracy may require manual verification before serious use.

Running this project

To run this project, follow these steps:

1. Make sure to have installed Conda and Tesseract:

$ mamba --version
   # or
$ conda --version
$ which tesseract

If Tesseract is installed, it should appear something like: /usr/local/bin/tesseract (for mac) or C:\Program Files\Tesseract-OCR\tesseract.exe (for Windows).

If it is not installed, follow the next steps to download it:

If you have a MacOS:
1. Install Tesseract using Homebrew (recommended):
```
$ brew install tesseract
```
1. Verify the installation
```
$ tesseract --version
```
It should display something like tesseract 5.x.x.
If you have Windows:
1. Download the installer by going to the official page of Tesseract: Official Page Tesseract Download the latest version available.
2. Install Tesseract by running the installer and follow the setup instructions. Make sure to check the "Add Tesseract to PATH" option during installation.
3. Verify the installation
```
$ tesseract --version
```
It should display something like tesseract 5.x.x.

2. Clone this repository

3. Create and activate environment

$ mamba env create -f environment.yml
$ conda activate final_project_btb

4. Download data

You have two options to download the data:

Via Google Drive: Click on this link ([https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing\]) and then click "Download".
Via Dropbox: Click on this link ([https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0\]) and then click "Download".

5. Place data in the data folder of src/final_project_btb

Path: final-project-s33btorr/src/final_project_btb/data

6. Run Pytask command

$ pytask

Short explanation of the project

Motivation

My motivation for this project stems from the fact that I could not find any pre-trained model, software, or package that could accurately read the table I needed given its complexity. Therefore, I trained a model using images similar to those I need to extract, allowing me to automate the extraction of a large number of pages in the future. With other programs, this process would take hours and result in a significant number of errors.

Overview

In this project, I have trained a deep learning model to detect the columns of a table from scanned PDFs using the Roboflow dataset I generated. After training, the model can identify the positions of different table columns. The extracted data is then processed and cleaned for analysis.

Dataset

You can access the dataset used for training via the following link: Roboflow Dataset

Training the Model

To train the model, I used the following approach:

The model was trained using a GPU provided by Google Colab.
The model was saved after training as model_final-2.pth.
The model is capable of detecting the columns in the table from a specific scanned PDF.

You can view and download to modify the code used to train the model in this notebook: Training Model Notebook

Making Predictions

Once the model is trained, it is saved as model_final-2.pth. This file is used to:

Extract the text using PyTesseract. I noticed that PyTesseract leaves a blank cell whenever the text goes to a new line. This can be used to determine the boundaries of each row in the table.
Predict column positions in new PDF tables with similar structure. Use this predictions to know where the text is located regarding the columns.

Cleaning the Data

After extracting the data, the following cleaning steps were performed:

Handling Missing Values: If the first column is empty, it means that row is part of the previous one. These rows are merged accordingly.
Numeric Fields: Cleaned numeric fields to ensure they are in a format suitable for analysis (e.g., removing non-numeric characters).
CVS and Graph Generation: Generated the cvs file and some basic graphs to visualize the cleaned data.

Possible Issues:

It is possible to find issues while running this project in a different computer. The project was generated in my compiter which is a macOS, version 12.7.6.

If the problem is generated because of pytorch, torchvision, torchaudio, you can try entering the environment.yml file and eliminating pytorch from the list of dependencies and adding it to channels. If you encounter problems with detectron 2, you can enter the website were it explains the requirements for obtaining this package.

References

Code for training model:

Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020). "A Large Dataset of Historical Japanese Documents with Complex Layouts" arXiv:2004.08686

Model used for training:

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. Retrieved from https://github.com/facebookresearch/detectron2.

Template used:

Template

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
inst		inst
src/final_project_btb		src/final_project_btb
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint.yml		.yamllint.yml
CITATION		CITATION
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract

Running this project

1. Make sure to have installed Conda and Tesseract:

2. Clone this repository

3. Create and activate environment

4. Download data

5. Place data in the data folder of src/final_project_btb

6. Run Pytask command

Short explanation of the project

Motivation

Overview

Dataset

Training the Model

Making Predictions

Cleaning the Data

Possible Issues:

References

Code for training model:

Model used for training:

Template used:

About

Uh oh!

Releases

Packages

Languages

License

iame-uni-bonn/final-project-s33btorr

Folders and files

Latest commit

History

Repository files navigation

Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract

Running this project

1. Make sure to have installed Conda and Tesseract:

2. Clone this repository

3. Create and activate environment

4. Download data

5. Place data in the data folder of src/final_project_btb

6. Run Pytask command

Short explanation of the project

Motivation

Overview

Dataset

Training the Model

Making Predictions

Cleaning the Data

Possible Issues:

References

Code for training model:

Model used for training:

Template used:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages