Uploading the whole directory wasn't possible on gihub due to size limitations and conflicts arising during the push
A Zero-Shot Semantic Pipeline for Classifying, Summarizing, and Querying Long Legal Documents
This project is an end-to-end AI workflow for legal intelligence:
- Zero-shot semantic classification
- Hybrid extractive + abstractive summarization
- RAG-powered question answering with vector search
Designed especially for long judgments, case files, and statutory documents.
Traditional classifiers need thousands of labeled samples. This system doesn’t.
It uses a Zero-Shot Semantic Classifier powered by embeddings.
- Convert raw text into structured TOON JSON format
- Chunk long documents (because BERT-style models have a 512-token limit)
Using: BAAI/bge-small-en
- Top performer on the MTEB Benchmark
- Light enough to run locally
- Often outperforms older OpenAI embedding models
-
65+ legal categories are embedded into vectors
-
Compute cosine similarity of each chunk to each category
-
Apply Max-Pooling Category Assignment:
If even one chunk strongly signals “Murder”, the whole document is classified as Murder.
Legal docs require accuracy + readability. To avoid hallucinations, the summarizer uses a hybrid pipeline:
Captures the core facts using mathematical sentence centrality.
Transforms facts into a polished, human-like summary.
The combination ensures the output is smooth but grounded in truth.
All processed chunks are stored in a ChromaDB vector database.
Workflow:
- User asks a question
- System performs semantic retrieval
- Retrieved chunks + prompt → LLM
- LLM produces a grounded, context-aware answer
This becomes the Q&A brain of the system.
“TextRank relies on word overlap, which fails in legal documents where long sentences share common boilerplate words (‘plaintiff’, ‘court’, ‘order’). LexRank uses TF-IDF + cosine similarity, making rare legal terms more influential and finding the true centroid sentence that represents the document’s core meaning.”
Benefits:
- ✔ Highlights rare but meaningful legal terms
- ✔ Identifies the central holding / verdict
- ✔ Avoids selecting long meaningless sentences
“Legal summarization must be hallucination-free. Pure abstractive models (like BART/GPT) may invent dates or sections if fed noisy inputs. LexRank extracts the top 20 factual sentences first, and BART is constrained to rewrite only those. This makes the summary polished but mathematically grounded.”
Why Hybrid?
- ✔ Eliminates noise from 50+ page judgments
- ✔ Guarantees factual consistency
- ✔ Produces human-readable summaries without risk
| Component | Technology |
|---|---|
| Embeddings | BAAI/bge-small-en |
| Extractive Summarization | LexRank |
| Abstractive Summarization | BART |
| Vector DB | ChromaDB |
| Similarity Metric | Cosine Similarity |
| RAG Pipeline | Custom implementation |
├── data/
├── preprocessing/
│ ├── toon_converter.py
│ ├── chunker.py
├── embeddings/
│ └── embedder.py
├── taxonomy/
│ └── classifier.py
├── summarizer/
│ ├── lexrank_extractor.py
│ └── bart_summarizer.py
├── rag/
│ ├── chroma_store.py
│ └── retrieval.py
└── app/
└── main.py
- Convert raw legal PDF → TOON JSON
- Chunk + embed using BGE-Small
- Vector similarity → assign category
- LexRank → extract facts
- BART → abstract summary
- ChromaDB → store chunks
- Ask question → retrieve relevant chunks
- LLM → grounded answer
- Legal-tech startups
- Court document analysis
- Compliance automation
- Case-law retrieval systems
- Enterprise search solutions




