Skip to content

FR34KY-CODER/Legal-Case-Intelligence-System-Phase-2

Repository files navigation

Uploading the whole directory wasn't possible on gihub due to size limitations and conflicts arising during the push

⚖️ Legal Document Classifier + Summarizer + RAG Search

A Zero-Shot Semantic Pipeline for Classifying, Summarizing, and Querying Long Legal Documents


🚀 Overview

This project is an end-to-end AI workflow for legal intelligence:

  • Zero-shot semantic classification
  • Hybrid extractive + abstractive summarization
  • RAG-powered question answering with vector search

Designed especially for long judgments, case files, and statutory documents.


🏛️ System Architecture (The 4-Stage Story)

Image

Image

Image

Stage 1 — Semantic Vectorization & Taxonomy (The Classifier)

Traditional classifiers need thousands of labeled samples. This system doesn’t.

It uses a Zero-Shot Semantic Classifier powered by embeddings.

🔹 Preprocessing

  • Convert raw text into structured TOON JSON format
  • Chunk long documents (because BERT-style models have a 512-token limit)

🔹 Embeddings

Using: BAAI/bge-small-en

  • Top performer on the MTEB Benchmark
  • Light enough to run locally
  • Often outperforms older OpenAI embedding models

🔹 Automated Taxonomy

  • 65+ legal categories are embedded into vectors

  • Compute cosine similarity of each chunk to each category

  • Apply Max-Pooling Category Assignment:

    If even one chunk strongly signals “Murder”, the whole document is classified as Murder.


Stage 2 — Hybrid Summarization (The Reader)

Legal docs require accuracy + readability. To avoid hallucinations, the summarizer uses a hybrid pipeline:

🔹 Extractive (LexRank)

Captures the core facts using mathematical sentence centrality.

🔹 Abstractive (BART)

Transforms facts into a polished, human-like summary.

The combination ensures the output is smooth but grounded in truth.


Stage 3 — RAG + Semantic Search (The Brain)

All processed chunks are stored in a ChromaDB vector database.

Workflow:

  1. User asks a question
  2. System performs semantic retrieval
  3. Retrieved chunks + prompt → LLM
  4. LLM produces a grounded, context-aware answer

This becomes the Q&A brain of the system.


🔍 Deep Dive: Why LexRank Instead of TextRank?

Image

Image

Image

Image

“TextRank relies on word overlap, which fails in legal documents where long sentences share common boilerplate words (‘plaintiff’, ‘court’, ‘order’). LexRank uses TF-IDF + cosine similarity, making rare legal terms more influential and finding the true centroid sentence that represents the document’s core meaning.”

Benefits:

  • ✔ Highlights rare but meaningful legal terms
  • ✔ Identifies the central holding / verdict
  • ✔ Avoids selecting long meaningless sentences

🧠 Deep Dive: Hallucination Control via Hybrid Design

“Legal summarization must be hallucination-free. Pure abstractive models (like BART/GPT) may invent dates or sections if fed noisy inputs. LexRank extracts the top 20 factual sentences first, and BART is constrained to rewrite only those. This makes the summary polished but mathematically grounded.”

Why Hybrid?

  • ✔ Eliminates noise from 50+ page judgments
  • ✔ Guarantees factual consistency
  • ✔ Produces human-readable summaries without risk

🏗️ Tech Stack

Component Technology
Embeddings BAAI/bge-small-en
Extractive Summarization LexRank
Abstractive Summarization BART
Vector DB ChromaDB
Similarity Metric Cosine Similarity
RAG Pipeline Custom implementation

📁 Folder Structure

├── data/
├── preprocessing/
│   ├── toon_converter.py
│   ├── chunker.py
├── embeddings/
│   └── embedder.py
├── taxonomy/
│   └── classifier.py
├── summarizer/
│   ├── lexrank_extractor.py
│   └── bart_summarizer.py
├── rag/
│   ├── chroma_store.py
│   └── retrieval.py
└── app/
    └── main.py

▶️ How It Works (In 30 Seconds)

  1. Convert raw legal PDF → TOON JSON
  2. Chunk + embed using BGE-Small
  3. Vector similarity → assign category
  4. LexRank → extract facts
  5. BART → abstract summary
  6. ChromaDB → store chunks
  7. Ask question → retrieve relevant chunks
  8. LLM → grounded answer

🎯 Ideal For

  • Legal-tech startups
  • Court document analysis
  • Compliance automation
  • Case-law retrieval systems
  • Enterprise search solutions

Releases

No releases published

Packages

 
 
 

Contributors