This is a GitHub Repository for the implementation of my thesis@UniversitaBocconi.
There are four sub-repositories which serve different purposes and utilize different models:
- training_Mask_RCNN is a fork of https://github.com/Layout-Parser/layout-model-training which is a wrapper of Facebook's Detectron2 https://github.com/facebookresearch/detectron2, with a configuration file for training Mask R-CNN with Feature Pyramid Network. Additional functionalities are added to the repository in order to convert the training dataset from XML format to json file with a custom taxonomy, split the dataset, and create test set. Further, a tutorial of how to visualize the training process is added, so users can see if the model is overfitting. If you want to run the tutorial, there is a detail guide in one of the notebook of how to access and download Prima Dataset, a benchmark for magazine and newspaper layout analysis. If you find any difficulties with setting up the environment, you can refer to https://github.com/Layout-Parser/layout-model-training/blob/master/requirements.txt
- inference_Mask_RCNN is a fork of https://github.com/Layout-Parser/layout-parser which is a library for model initialization, inference, and visualization. The inference is modified in order to utilize both bounding boxes and masks, and a class for post-processing is added in order to convert the raw outputs of the model into refined bounding boxes on which we can call the Tesseract API, which is the OCR engine. In the repository there is a tutorial of how to initialize model, run inference, and visualize the results. Further, a custom model is provided which can be downloaded from here - https://drive.google.com/drive/folders/15KNAPItTzDQwu-t5clvVq3b_JmhEI7th?usp=sharing. There is a tutorial on how to set them. If you find any difficulties with setting up the environment, you can refer to https://github.com/Layout-Parser/layout-parser/blob/main/installation.md
- training_SegmentationTransformer is a fork of https://github.com/fudan-zvg/SETR which is a repository for training and inference of Segmentation Transformers trained on different benchmark datasets and having different decoders. In the repository the part on inference is removed and tutorial on how to set up and train a model of your choice is created. Further, there are other tutorial created on - how to monitor the training process, how to obtain evaluation metrics, and how to transform Prima Dataset from XML to mask images with color palettes. If you find any difficulties with setting up the environment, you can refer to https://github.com/fudan-zvg/SETR/blob/main/docs/install.md
- inference_SegmentationTransformer is a fork of https://github.com/fudan-zvg/SETR which is a repository for training and inference of Segmentation Transformers trained on different benchmark datasets and having different decoders. In the repository the part on training is remove. The source code of how to initialize a model is modified in order to make the process smoother. Post-processing pipeline is created as well as a custom data type is created in order to store the outputs of the model and create bounding boxes from the masks so that we can call the OCR engine, otherwise it wouldn't be possible. Two custom models are provided which can be downloaded from here - https://drive.google.com/drive/folders/1zOXvcoHRwZZWZIC5mTYRVoxAI2rQUepe?usp=sharing & https://drive.google.com/drive/folders/1ShZB0FlnGa2xXpFdWQCAptPARonXMv2g?usp=sharing. There is a tutorial on how to set them. If you find any difficulties with setting up the environment, you can refer to https://github.com/fudan-zvg/SETR/blob/main/docs/install.md