Legal Documents Retrieval and Analysis System

Project Overview

This project involves building a Retrieval-Augmented Generation (RAG) system to store, process, and analyze legal documents. The core components of this system include:

Vectorized Database: Legal documents are stored in a vectorized database using PostgreSQL with PGVector for efficient vector storage and retrieval.
Text Embedding: We use the JinaAI model, which is freely available, to generate embeddings for the text in legal documents.
Data Extraction: The Python library unstructured is used to extract structured data from the legal documents.
Data Processing: The LLaMA3 model is utilized to process the extracted data from the vectorized database. This involves applying queries to process the data and converting it into Python-CDM format with the help of LLaMA3.

Installation

To set up the project, you will need to install several libraries. Follow the steps below to install the necessary dependencies:

Install PostgreSQL and PGVector
- Ensure you have PostgreSQL installed on your system.
- Install the PGVector extension for PostgreSQL to enable vectorized storage.
Install Python Libraries
- Unstructured: Obtain the library from its Git repository.
- JinaAI: Obtain the model from its Git repository on Hugging Face.
- LLaMA3: You can get the LLaMA3 model from Ollama.

Usage

Vectorizing Legal Documents:
- Load your legal documents into the system.
- Use the JinaAI model to generate embeddings for these documents.
- Store the vectorized documents in the PostgreSQL database with PGVector.
Extracting Data:
- Use the unstructured library to extract data from the documents.
- Ensure the extracted data is correctly structured and ready for further processing.
Processing Data:
- Use the LLaMA3 model to process the data extracted from the vectorized database.
- Apply necessary queries to transform the data into the desired format.
- Convert the processed data into Python-CDM format using LLaMA3.

Main Components

main.py: This file contains a chatbot interface to make queries to the LLaMA3 model. The chatbot will access the data stored in the RAG and provide relevant responses based on the queries.
test_unstructured.ipynb: Execute this file if you want to store any data in the repository. It handles the extraction and storage of data from legal documents.
test_builder.ipynb: Execute this file if you want to build the contractDetails CDM object. It processes the data and creates the desired contract details structure.

Helpful DCO Resources

License

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figures		figures
pgvector		pgvector
resources		resources
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.spdx		LICENSE.spdx
NOTICE		NOTICE
README.md		README.md
config.py		config.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Documents Retrieval and Analysis System

Project Overview

Installation

Usage

Main Components

Helpful DCO Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legal Documents Retrieval and Analysis System

Project Overview

Installation

Usage

Main Components

Helpful DCO Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages