This project involves building a Retrieval-Augmented Generation (RAG) system to store, process, and analyze legal documents. The core components of this system include:
- Vectorized Database: Legal documents are stored in a vectorized database using PostgreSQL with PGVector for efficient vector storage and retrieval.
- Text Embedding: We use the JinaAI model, which is freely available, to generate embeddings for the text in legal documents.
- Data Extraction: The Python library
unstructuredis used to extract structured data from the legal documents. - Data Processing: The LLaMA3 model is utilized to process the extracted data from the vectorized database. This involves applying queries to process the data and converting it into Python-CDM format with the help of LLaMA3.
To set up the project, you will need to install several libraries. Follow the steps below to install the necessary dependencies:
-
Install PostgreSQL and PGVector
- Ensure you have PostgreSQL installed on your system.
- Install the PGVector extension for PostgreSQL to enable vectorized storage.
-
Install Python Libraries
- Unstructured: Obtain the library from its Git repository.
- JinaAI: Obtain the model from its Git repository on Hugging Face.
- LLaMA3: You can get the LLaMA3 model from Ollama.
-
Vectorizing Legal Documents:
- Load your legal documents into the system.
- Use the JinaAI model to generate embeddings for these documents.
- Store the vectorized documents in the PostgreSQL database with PGVector.
-
Extracting Data:
- Use the
unstructuredlibrary to extract data from the documents. - Ensure the extracted data is correctly structured and ready for further processing.
- Use the
-
Processing Data:
- Use the LLaMA3 model to process the data extracted from the vectorized database.
- Apply necessary queries to transform the data into the desired format.
- Convert the processed data into Python-CDM format using LLaMA3.
-
main.py: This file contains a chatbot interface to make queries to the LLaMA3 model. The chatbot will access the data stored in the RAG and provide relevant responses based on the queries.
-
test_unstructured.ipynb: Execute this file if you want to store any data in the repository. It handles the extraction and storage of data from legal documents.
-
test_builder.ipynb: Execute this file if you want to build the
contractDetailsCDM object. It processes the data and creates the desired contract details structure.
Copyright 2024 TradeHeader SL
Distributed under the Apache License, Version 2.0.
SPDX-License-Identifier: Apache-2.0