This project focuses on:
- Training a deep learning model to detect tabular data on PDFs.
- Detection and extraction of a specific PDF file with complex tables.
- Cleaning of the data extracted.
It uses the model called "COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml" from
Detectron2 for training the model. The database used in the training was made by me,
it contains 25 images in which the columns have been marked. For the detection of the
text PyTesseract was used.
Note
PyTesseract can sometimes have issues reading words correctly. Some issues with OCR accuracy may require manual verification before serious use.
To run this project, follow these steps:
$ mamba --version
# or
$ conda --version
$ which tesseractIf Tesseract is installed, it should appear something like: /usr/local/bin/tesseract
(for mac) or C:\Program Files\Tesseract-OCR\tesseract.exe (for Windows).
If it is not installed, follow the next steps to download it:
-
If you have a MacOS:
- Install Tesseract using Homebrew (recommended):
$ brew install tesseract- Verify the installation
$ tesseract --versionIt should display something like tesseract 5.x.x.
-
If you have Windows:
- Download the installer by going to the official page of Tesseract: Official Page Tesseract Download the latest version available.
- Install Tesseract by running the installer and follow the setup instructions. Make sure to check the "Add Tesseract to PATH" option during installation.
- Verify the installation
$ tesseract --versionIt should display something like tesseract 5.x.x.
$ mamba env create -f environment.yml
$ conda activate final_project_btbYou have two options to download the data:
- Via Google Drive: Click on this link ([https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing\]) and then click "Download".
- Via Dropbox: Click on this link ([https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0\]) and then click "Download".
Path: final-project-s33btorr/src/final_project_btb/data
$ pytaskMy motivation for this project stems from the fact that I could not find any pre-trained model, software, or package that could accurately read the table I needed given its complexity. Therefore, I trained a model using images similar to those I need to extract, allowing me to automate the extraction of a large number of pages in the future. With other programs, this process would take hours and result in a significant number of errors.
In this project, I have trained a deep learning model to detect the columns of a table from scanned PDFs using the Roboflow dataset I generated. After training, the model can identify the positions of different table columns. The extracted data is then processed and cleaned for analysis.
You can access the dataset used for training via the following link: Roboflow Dataset
To train the model, I used the following approach:
- The model was trained using a GPU provided by Google Colab.
- The model was saved after training as
model_final-2.pth. - The model is capable of detecting the columns in the table from a specific scanned PDF.
You can view and download to modify the code used to train the model in this notebook: Training Model Notebook
Once the model is trained, it is saved as model_final-2.pth. This file is used to:
- Extract the text using PyTesseract. I noticed that PyTesseract leaves a blank cell whenever the text goes to a new line. This can be used to determine the boundaries of each row in the table.
- Predict column positions in new PDF tables with similar structure. Use this predictions to know where the text is located regarding the columns.
After extracting the data, the following cleaning steps were performed:
- Handling Missing Values: If the first column is empty, it means that row is part of the previous one. These rows are merged accordingly.
- Numeric Fields: Cleaned numeric fields to ensure they are in a format suitable for analysis (e.g., removing non-numeric characters).
- CVS and Graph Generation: Generated the cvs file and some basic graphs to visualize the cleaned data.
It is possible to find issues while running this project in a different computer. The project was generated in my compiter which is a macOS, version 12.7.6.
If the problem is generated because of pytorch, torchvision, torchaudio, you can
try entering the environment.yml file and eliminating pytorch from the list of
dependencies and adding it to channels. If you encounter problems with detectron 2, you
can enter the
website were it
explains the requirements for obtaining this package.
Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020). "A Large Dataset of Historical Japanese Documents with Complex Layouts" arXiv:2004.08686
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. Retrieved from https://github.com/facebookresearch/detectron2.