A specialized NLP model for simplifying Indian legal and bureaucratic texts
AETE converts complex bureaucratic, legal, and administrative English into simple, easy-to-understand language. The model is specifically fine-tuned on Indian legal texts including the Constitution, IPC, and CrPC.
Example:
- Input: "The regulatory framework's multilevel governance model exhibits implementation challenges in resource-constrained environments."
- Output: "The regulatory framework's multilevel governance model shows challenges in managing resources."
| Metric | Score | Status |
|---|---|---|
| BLEU Score | 95.9/100 | 🏆 Excellent |
| ROUGE-1 | 0.982 | 🏆 Near-perfect |
| ROUGE-2 | 0.970 | 🏆 Near-perfect |
| ROUGE-L | 0.982 | 🏆 Near-perfect |
Training Dataset: 736 text pairs
- Indian Constitution (500 pairs)
- IPC - Indian Penal Code
- CrPC - Criminal Procedure Code
- Policy documents (200 pairs)
- Bureaucratic texts (38 pairs)
- Model: T5-small (60M parameters)
- Framework: PyTorch + HuggingFace Transformers
- Training: GPU with FP16 mixed precision
- Training Time: ~5 minutes on RTX 4070
- Inference Speed: <1 second per text
# Clone repository
git clone <your-repo-url>
cd NLP_project
# Install dependencies
pip install -r requirements.txt# Step 1: Generate training data
python create_training_data.py
# Step 2: Train model (fast - 5 minutes)
python train_final.py
# Step 3: Evaluate performance
python evaluate.py# Launch Gradio web interface
python demo.pyThen open http://localhost:7860 in your browser.
NLP_project/
├── demo.py # Gradio web interface
├── train_final.py # Main training script
├── evaluate.py # Evaluation metrics (BLEU, ROUGE)
├── create_training_data.py # Data preparation
├── requirements.txt # Python dependencies
├── README.md # This file
└── aete_legal_model/ # Trained model (download separately)
- Domain-Specific - Trained exclusively on Indian legal/bureaucratic texts
- Local & Private - Runs on your own hardware, no internet needed
- Consistent - Always produces the same style of simplification
- Fast - <1 second inference time
- Specialized Vocabulary - Optimized for Indian legal terminology
- Measurable - BLEU 95.9 specifically for legal simplification
The model was trained in three phases:
- Phase 1: Initial training on 38 bureaucratic text pairs (BLEU 65)
- Phase 2: Expanded with 200 policy documents (BLEU 92)
- Phase 3: Added 500 legal texts from Indian archive (BLEU 95.9)
Total training time: ~10 minutes across all phases
torch>=2.0.0
transformers>=4.30.0
gradio>=4.0.0
pandas>=2.0.0
scikit-learn>=1.3.0
sacrebleu>=2.3.0
rouge-score>=0.1.2
textstat>=0.7.3
tqdm>=4.65.0
- BLEU Score: Measures similarity to reference translations (0-100, higher is better)
- ROUGE-1: Unigram (word) overlap with reference (0-1)
- ROUGE-2: Bigram (word pair) overlap (0-1)
- ROUGE-L: Longest common subsequence (0-1)
- Readability: Flesch Reading Ease improvement
- Simplifying government notifications
- Making legal documents accessible
- Converting policy documents to plain language
- Helping citizens understand bureaucratic text
- Legal education and training
T5-small Encoder-Decoder Architecture
├── Encoder: 6 layers, 512 hidden size
├── Decoder: 6 layers, 512 hidden size
├── Parameters: 60,506,624
└── Vocabulary: 32,128 tokens
- Optimizer: AdamW
- Learning Rate: 1.5e-4 with warmup
- Batch Size: 4 (effective 16 with gradient accumulation)
- Mixed Precision: FP16
- Max Sequence Length: 256 tokens
- Training Steps: 600
- Validation Split: 80/20
Due to file size limitations, the trained model is not included in this repository.
To use the pre-trained model:
- Download from [model link]
- Extract to
./aete_legal_model/ - Run
python demo.py
Or train your own model using train_final.py (~5 minutes on GPU).
Contributions are welcome! Areas for improvement:
- Adding more legal domain data
- Supporting regional languages
- Improving handling of very long documents
- Creating additional evaluation metrics
MIT License - see LICENSE file for details
Created as part of NLP course project at Mahindra University
- HuggingFace Transformers library
- Indian legal text archive from Kaggle
- T5 model by Google Research
For questions or issues, please open a GitHub issue.
⭐ Star this repo if you find it useful!