Introduction
Managing and classifying financial documents manually is both time-consuming and error-prone. This project streamlines the process by leveraging deep learning techniques for automated classification. Utilizing TensorFlow and fine-tuning the FinBERT model on a custom dataset, we achieve precise categorization of financial documents. The model is seamlessly integrated into a user-friendly Streamlit application and deployed on the Hugging Face platform, ensuring high accuracy and efficiency in financial document management.
Table of Contents
- Key Technologies and Skills
- Installation
- Usage
- Features
- Contributing
- License
- Contact
Key Technologies and Skills
- Python
- scikit-learn
- TensorFlow
- Transformers
- Numpy
- Pandas
- BeautifulSoup
- Matplotlib
- Seaborn
- Streamlit
- Hugging Face
- Application Programming Interface (API)
Installation
To run this project, you need to install the following packages:
pip install python-dotenv
pip install datasets
pip install tensorflow
pip install transformers
pip install sentencepiece
pip install numpy
pip install pandas
pip install beautifulsoup4
pip install matplotlib
pip install seaborn
pip install streamlit
pip install streamlit_extras
pip install huggingface-hub
Note: If you face "ImportError: DLL load failed" error while installing TensorFlow,
pip uninstall tensorflow
pip install tensorflow==2.12.0 --upgrade
Usage
To use this project, follow these steps:
- Clone the repository:
git clone https://github.com/gopiashokan/Finance-Document-Classification-Using-Deep-Learning.git
- Install the required packages:
pip install -r requirements.txt
- Run the Streamlit app:
streamlit run app.py
- Access the app in your browser at
http://localhost:8501
Features
The dataset comprises HTML files organized into five distinct folders, namely Balance Sheets, Cash Flow, Income Statement, Notes, and Others. These folders represent various financial document categories. You can access the dataset via the following download link.
📙 Dataset Link: https://www.kaggle.com/datasets/gopiashokan/financial-document-classification-dataset
-
Text Extraction: BeautifulSoup is utilized to parse and extract text content from HTML files. The extracted text is structured into a DataFrame using Pandas, and the target labels are encoded to facilitate numerical processing for model training.
-
Data Splitting: The dataset was divided into training and testing sets using a Scikit-learn. This partitioning strategy ensured an appropriate distribution of data for model training and evaluation, thereby enhancing the robustness of the trained model.
-
Tokenization: The FinBERT tokenizer from Hugging Face Transformers library
yiyanghkust/finbert-pretrain
is applied to convert text data into numerical vectors, enabling the model to process financial terminology effectively. -
Padding and Truncation: Tokenized sequences are padded and truncated to a maximum length of 512, ensuring consistent input sizes both training and testing datasets.
-
Pretrained Model: The FinBERT is a domain-specific BERT model for financial texts, is loaded and Fine-tuned using Transfer Learning on the custom dataset for improving classification accuracy.
-
Optimization Strategy: The model is compiled using the
Adam
optimizer,SparseCategoricalCrossentropy
loss function, andAccuracy
as the evaluation metric, optimizing performance across multiple financial document classes. -
Training and Evaluation: The model is trained and validated using TensorFlow, achieving a classification accuracy of 95.84%, demonstrating its effectiveness in financial document classification.
-
Hugging Face Hub Integration: The Fine-tuned model and tokenizer are deployed on the Hugging Face Hub using Access Token, allowing easy accessibility and inference through APIs.
Hugging Face Hub: https://huggingface.co/gopiashokan/Financial-Document-Classification-using-Deep-Learning -
Application Development: A user-friendly Streamlit application was developed to allow users to upload new HTML documents for classification. The application provided a simple interface for users to interact with, displaying the predicted class and associated confidence scores. Additionally, the application showcased the uploaded document, enhancing the interpretability of the classification results.
-
API-based Inference: The Streamlit application was deployed on the Hugging Face platform, enabling easy access for users to utilize the model for document classification. By deploying on Hugging Face, users can seamlessly upload new HTML documents and sends extracted text to the Hugging Face API, retrieves model predictions and displays the highest confidence class along with its score.
🚀 Application: https://huggingface.co/spaces/gopiashokan/Financial-Document-Classification-using-Deep-Learning
Conclusion:
This project successfully classifies financial documents using deep learning and transfer learning techniques. By leveraging FinBERT and fine-tuning it on a domain-specific dataset, we achieve high accuracy in document categorization. The integration of a user-friendly Streamlit application enhances accessibility, making financial document classification more efficient and scalable.
References:
- scikit-learn Documentation: https://scikit-learn.org/
- TensorFlow Documentation: https://www.tensorflow.org/
- Transformers Documentation: https://huggingface.co/docs/transformers/en/index
- Streamlit Documentation: https://docs.streamlit.io/
Contributing:
Contributions to this project are welcome! If you encounter any issues or have suggestions for improvements, please feel free to submit a pull request.
License:
This project is licensed under the MIT License. Please review the LICENSE file for more details.
Contact:
📧 Email: [email protected]
🌐 LinkedIn: linkedin.com/in/gopiashokan
For any further questions or inquiries, feel free to reach out. We are happy to assist you with any queries.