Financial Document Classification using Deep Learning

Introduction

Managing and classifying financial documents manually is both time-consuming and error-prone. This project streamlines the process by leveraging deep learning techniques for automated classification. Utilizing TensorFlow and fine-tuning the FinBERT model on a custom dataset, we achieve precise categorization of financial documents. The model is seamlessly integrated into a user-friendly Streamlit application and deployed on the Hugging Face platform, ensuring high accuracy and efficiency in financial document management.

Table of Contents

Key Technologies and Skills
Installation
Usage
Features
Contributing
License
Contact

Key Technologies and Skills

Python
scikit-learn
TensorFlow
Transformers
Numpy
Pandas
BeautifulSoup
Matplotlib
Seaborn
Streamlit
Hugging Face
Application Programming Interface (API)

Installation

To run this project, you need to install the following packages:

pip install python-dotenv
pip install datasets
pip install tensorflow
pip install transformers
pip install sentencepiece
pip install numpy
pip install pandas
pip install beautifulsoup4
pip install matplotlib
pip install seaborn
pip install streamlit
pip install streamlit_extras
pip install huggingface-hub

Note: If you face "ImportError: DLL load failed" error while installing TensorFlow,

pip uninstall tensorflow
pip install tensorflow==2.12.0 --upgrade

Usage

To use this project, follow these steps:

Clone the repository: git clone https://github.com/gopiashokan/Finance-Document-Classification-Using-Deep-Learning.git
Install the required packages: pip install -r requirements.txt
Run the Streamlit app: streamlit run app.py
Access the app in your browser at http://localhost:8501

Features

Data Collection:

The dataset comprises HTML files organized into five distinct folders, namely Balance Sheets, Cash Flow, Income Statement, Notes, and Others. These folders represent various financial document categories. You can access the dataset via the following download link.

📙 Dataset Link: https://www.kaggle.com/datasets/gopiashokan/financial-document-classification-dataset

Data Preprocessing:

Text Extraction: BeautifulSoup is utilized to parse and extract text content from HTML files. The extracted text is structured into a DataFrame using Pandas, and the target labels are encoded to facilitate numerical processing for model training.
Data Splitting: The dataset was divided into training and testing sets using a Scikit-learn. This partitioning strategy ensured an appropriate distribution of data for model training and evaluation, thereby enhancing the robustness of the trained model.
Tokenization: The FinBERT tokenizer from Hugging Face Transformers library yiyanghkust/finbert-pretrain is applied to convert text data into numerical vectors, enabling the model to process financial terminology effectively.
Padding and Truncation: Tokenized sequences are padded and truncated to a maximum length of 512, ensuring consistent input sizes both training and testing datasets.

Model Training:

Pretrained Model: The FinBERT is a domain-specific BERT model for financial texts, is loaded and Fine-tuned using Transfer Learning on the custom dataset for improving classification accuracy.
Optimization Strategy: The model is compiled using the Adam optimizer, SparseCategoricalCrossentropy loss function, and Accuracy as the evaluation metric, optimizing performance across multiple financial document classes.
Training and Evaluation: The model is trained and validated using TensorFlow, achieving a classification accuracy of 95.84%, demonstrating its effectiveness in financial document classification.

Model Deployment and Inference:

Hugging Face Hub Integration: The Fine-tuned model and tokenizer are deployed on the Hugging Face Hub using Access Token, allowing easy accessibility and inference through APIs.
Hugging Face Hub: https://huggingface.co/gopiashokan/Financial-Document-Classification-using-Deep-Learning
Application Development: A user-friendly Streamlit application was developed to allow users to upload new HTML documents for classification. The application provided a simple interface for users to interact with, displaying the predicted class and associated confidence scores. Additionally, the application showcased the uploaded document, enhancing the interpretability of the classification results.
API-based Inference: The Streamlit application was deployed on the Hugging Face platform, enabling easy access for users to utilize the model for document classification. By deploying on Hugging Face, users can seamlessly upload new HTML documents and sends extracted text to the Hugging Face API, retrieves model predictions and displays the highest confidence class along with its score.

🚀 Application: https://huggingface.co/spaces/gopiashokan/Financial-Document-Classification-using-Deep-Learning

Conclusion:

This project successfully classifies financial documents using deep learning and transfer learning techniques. By leveraging FinBERT and fine-tuning it on a domain-specific dataset, we achieve high accuracy in document categorization. The integration of a user-friendly Streamlit application enhances accessibility, making financial document classification more efficient and scalable.

References:

scikit-learn Documentation: https://scikit-learn.org/
TensorFlow Documentation: https://www.tensorflow.org/
Transformers Documentation: https://huggingface.co/docs/transformers/en/index
Streamlit Documentation: https://docs.streamlit.io/

Contributing:

Contributions to this project are welcome! If you encounter any issues or have suggestions for improvements, please feel free to submit a pull request.

License:

This project is licensed under the MIT License. Please review the LICENSE file for more details.

Contact:

📧 Email: [email protected]

🌐 LinkedIn: linkedin.com/in/gopiashokan

For any further questions or inquiries, feel free to reach out. We are happy to assist you with any queries.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
image		image
samples		samples
.gitattributes		.gitattributes
Financial_Document_Classification.ipynb		Financial_Document_Classification.ipynb
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Financial Document Classification using Deep Learning

Data Collection:

Data Preprocessing:

Model Training:

Model Deployment and Inference:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

gopiashokan/Finance-Document-Classification-Using-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

Financial Document Classification using Deep Learning

Data Collection:

Data Preprocessing:

Model Training:

Model Deployment and Inference:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages