Skip to content

iame-uni-bonn/final-project-s33btorr

Repository files navigation

Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract

MIT license image Documentation Status image image

This project focuses on:

  1. Training a deep learning model to detect tabular data on PDFs.
  2. Detection and extraction of a specific PDF file with complex tables.
  3. Cleaning of the data extracted.

It uses the model called "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml" from Detectron2 for training the model. The database used in the training was made by me, it contains 25 images in which the columns have been marked. For the detection of the text PyTesseract was used.

Note

PyTesseract can sometimes have issues reading words correctly. Some issues with OCR accuracy may require manual verification before serious use.

Running this project

To run this project, follow these steps:

1. Make sure to have installed Conda and Tesseract:

$ mamba --version
   # or
$ conda --version
$ which tesseract

If Tesseract is installed, it should appear something like: /usr/local/bin/tesseract (for mac) or C:\Program Files\Tesseract-OCR\tesseract.exe (for Windows).

If it is not installed, follow the next steps to download it:

  • If you have a MacOS:

    1. Install Tesseract using Homebrew (recommended):
    $ brew install tesseract
    1. Verify the installation
    $ tesseract --version

    It should display something like tesseract 5.x.x.

  • If you have Windows:

    1. Download the installer by going to the official page of Tesseract: Official Page Tesseract Download the latest version available.
    2. Install Tesseract by running the installer and follow the setup instructions. Make sure to check the "Add Tesseract to PATH" option during installation.
    3. Verify the installation
    $ tesseract --version

    It should display something like tesseract 5.x.x.

2. Clone this repository

3. Create and activate environment

$ mamba env create -f environment.yml
$ conda activate final_project_btb

4. Download data

You have two options to download the data:

  1. Via Google Drive: Click on this link ([https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing\]) and then click "Download".
  2. Via Dropbox: Click on this link ([https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0\]) and then click "Download".

5. Place data in the data folder of src/final_project_btb

Path: final-project-s33btorr/src/final_project_btb/data

6. Run Pytask command

$ pytask

Short explanation of the project

Motivation

My motivation for this project stems from the fact that I could not find any pre-trained model, software, or package that could accurately read the table I needed given its complexity. Therefore, I trained a model using images similar to those I need to extract, allowing me to automate the extraction of a large number of pages in the future. With other programs, this process would take hours and result in a significant number of errors.

Overview

In this project, I have trained a deep learning model to detect the columns of a table from scanned PDFs using the Roboflow dataset I generated. After training, the model can identify the positions of different table columns. The extracted data is then processed and cleaned for analysis.

Dataset

You can access the dataset used for training via the following link: Roboflow Dataset

Training the Model

To train the model, I used the following approach:

  1. The model was trained using a GPU provided by Google Colab.
  2. The model was saved after training as model_final-2.pth.
  3. The model is capable of detecting the columns in the table from a specific scanned PDF.

You can view and download to modify the code used to train the model in this notebook: Training Model Notebook

Making Predictions

Once the model is trained, it is saved as model_final-2.pth. This file is used to:

  1. Extract the text using PyTesseract. I noticed that PyTesseract leaves a blank cell whenever the text goes to a new line. This can be used to determine the boundaries of each row in the table.
  2. Predict column positions in new PDF tables with similar structure. Use this predictions to know where the text is located regarding the columns.

Cleaning the Data

After extracting the data, the following cleaning steps were performed:

  1. Handling Missing Values: If the first column is empty, it means that row is part of the previous one. These rows are merged accordingly.
  2. Numeric Fields: Cleaned numeric fields to ensure they are in a format suitable for analysis (e.g., removing non-numeric characters).
  3. CVS and Graph Generation: Generated the cvs file and some basic graphs to visualize the cleaned data.

Possible Issues:

It is possible to find issues while running this project in a different computer. The project was generated in my compiter which is a macOS, version 12.7.6.

If the problem is generated because of pytorch, torchvision, torchaudio, you can try entering the environment.yml file and eliminating pytorch from the list of dependencies and adding it to channels. If you encounter problems with detectron 2, you can enter the website were it explains the requirements for obtaining this package.

References

Code for training model:

Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020). "A Large Dataset of Historical Japanese Documents with Complex Layouts" arXiv:2004.08686

Model used for training:

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. Retrieved from https://github.com/facebookresearch/detectron2.

Template used:

Template

About

final-project-s33btorr created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages