⚖️ Legal Document Classifier + Summarizer + RAG Search

Uploading the whole directory wasn't possible on gihub due to size limitations and conflicts arising during the push

⚖️ Legal Document Classifier + Summarizer + RAG Search

A Zero-Shot Semantic Pipeline for Classifying, Summarizing, and Querying Long Legal Documents

🚀 Overview

This project is an end-to-end AI workflow for legal intelligence:

Zero-shot semantic classification
Hybrid extractive + abstractive summarization
RAG-powered question answering with vector search

Designed especially for long judgments, case files, and statutory documents.

🏛️ System Architecture (The 4-Stage Story)

Stage 1 — Semantic Vectorization & Taxonomy (The Classifier)

Traditional classifiers need thousands of labeled samples. This system doesn’t.

It uses a Zero-Shot Semantic Classifier powered by embeddings.

🔹 Preprocessing

Convert raw text into structured TOON JSON format
Chunk long documents (because BERT-style models have a 512-token limit)

🔹 Embeddings

Using: BAAI/bge-small-en

Top performer on the MTEB Benchmark
Light enough to run locally
Often outperforms older OpenAI embedding models

🔹 Automated Taxonomy

65+ legal categories are embedded into vectors
Compute cosine similarity of each chunk to each category
Apply Max-Pooling Category Assignment:

If even one chunk strongly signals “Murder”, the whole document is classified as Murder.

Stage 2 — Hybrid Summarization (The Reader)

Legal docs require accuracy + readability. To avoid hallucinations, the summarizer uses a hybrid pipeline:

🔹 Extractive (LexRank)

Captures the core facts using mathematical sentence centrality.

🔹 Abstractive (BART)

Transforms facts into a polished, human-like summary.

The combination ensures the output is smooth but grounded in truth.

Stage 3 — RAG + Semantic Search (The Brain)

All processed chunks are stored in a ChromaDB vector database.

Workflow:

User asks a question
System performs semantic retrieval
Retrieved chunks + prompt → LLM
LLM produces a grounded, context-aware answer

This becomes the Q&A brain of the system.

🔍 Deep Dive: Why LexRank Instead of TextRank?

“TextRank relies on word overlap, which fails in legal documents where long sentences share common boilerplate words (‘plaintiff’, ‘court’, ‘order’). LexRank uses TF-IDF + cosine similarity, making rare legal terms more influential and finding the true centroid sentence that represents the document’s core meaning.”

Benefits:

✔ Highlights rare but meaningful legal terms
✔ Identifies the central holding / verdict
✔ Avoids selecting long meaningless sentences

🧠 Deep Dive: Hallucination Control via Hybrid Design

“Legal summarization must be hallucination-free. Pure abstractive models (like BART/GPT) may invent dates or sections if fed noisy inputs. LexRank extracts the top 20 factual sentences first, and BART is constrained to rewrite only those. This makes the summary polished but mathematically grounded.”

Why Hybrid?

✔ Eliminates noise from 50+ page judgments
✔ Guarantees factual consistency
✔ Produces human-readable summaries without risk

🏗️ Tech Stack

Component	Technology
Embeddings	BAAI/bge-small-en
Extractive Summarization	LexRank
Abstractive Summarization	BART
Vector DB	ChromaDB
Similarity Metric	Cosine Similarity
RAG Pipeline	Custom implementation

📁 Folder Structure

├── data/
├── preprocessing/
│   ├── toon_converter.py
│   ├── chunker.py
├── embeddings/
│   └── embedder.py
├── taxonomy/
│   └── classifier.py
├── summarizer/
│   ├── lexrank_extractor.py
│   └── bart_summarizer.py
├── rag/
│   ├── chroma_store.py
│   └── retrieval.py
└── app/
    └── main.py

▶️ How It Works (In 30 Seconds)

Convert raw legal PDF → TOON JSON
Chunk + embed using BGE-Small
Vector similarity → assign category
LexRank → extract facts
BART → abstract summary
ChromaDB → store chunks
Ask question → retrieve relevant chunks
LLM → grounded answer

🎯 Ideal For

Legal-tech startups
Court document analysis
Compliance automation
Case-law retrieval systems
Enterprise search solutions

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
RAG		RAG
CLASSIFICATION.ipynb		CLASSIFICATION.ipynb
README.md		README.md
SUMMARIZATION.ipynb		SUMMARIZATION.ipynb
answer_category.ipynb		answer_category.ipynb
cases_710.ipynb		cases_710.ipynb
category_embeddings.npy		category_embeddings.npy
embed_9700cases.ipynb		embed_9700cases.ipynb
embed_taxonomy.ipynb		embed_taxonomy.ipynb
legal_taxonomy.toon		legal_taxonomy.toon
legal_taxonomy_with_embeddings.toon		legal_taxonomy_with_embeddings.toon
sample_cases_summarized.csv		sample_cases_summarized.csv
sample_cases_summarized_hybrid.csv		sample_cases_summarized_hybrid.csv
show_file.ipynb		show_file.ipynb
taxonomy.ipynb		taxonomy.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uploading the whole directory wasn't possible on gihub due to size limitations and conflicts arising during the push

⚖️ Legal Document Classifier + Summarizer + RAG Search

🚀 Overview

🏛️ System Architecture (The 4-Stage Story)

Stage 1 — Semantic Vectorization & Taxonomy (The Classifier)

🔹 Preprocessing

🔹 Embeddings

🔹 Automated Taxonomy

Stage 2 — Hybrid Summarization (The Reader)

🔹 Extractive (LexRank)

🔹 Abstractive (BART)

Stage 3 — RAG + Semantic Search (The Brain)

🔍 Deep Dive: Why LexRank Instead of TextRank?

🧠 Deep Dive: Hallucination Control via Hybrid Design

🏗️ Tech Stack

📁 Folder Structure

▶️ How It Works (In 30 Seconds)

🎯 Ideal For

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Uploading the whole directory wasn't possible on gihub due to size limitations and conflicts arising during the push

⚖️ Legal Document Classifier + Summarizer + RAG Search

🚀 Overview

🏛️ System Architecture (The 4-Stage Story)

Stage 1 — Semantic Vectorization & Taxonomy (The Classifier)

🔹 Preprocessing

🔹 Embeddings

🔹 Automated Taxonomy

Stage 2 — Hybrid Summarization (The Reader)

🔹 Extractive (LexRank)

🔹 Abstractive (BART)

Stage 3 — RAG + Semantic Search (The Brain)

🔍 Deep Dive: Why LexRank Instead of TextRank?

🧠 Deep Dive: Hallucination Control via Hybrid Design

🏗️ Tech Stack

📁 Folder Structure

▶️ How It Works (In 30 Seconds)

🎯 Ideal For

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages