Skip to content

satzgits/Aete_legal_text_simplification-

Repository files navigation

🧠 AETE - Administrative English Translation Engine

A specialized NLP model for simplifying Indian legal and bureaucratic texts

Python 3.11+ PyTorch License: MIT

🎯 Project Overview

AETE converts complex bureaucratic, legal, and administrative English into simple, easy-to-understand language. The model is specifically fine-tuned on Indian legal texts including the Constitution, IPC, and CrPC.

Example:

  • Input: "The regulatory framework's multilevel governance model exhibits implementation challenges in resource-constrained environments."
  • Output: "The regulatory framework's multilevel governance model shows challenges in managing resources."

📊 Model Performance

Metric Score Status
BLEU Score 95.9/100 🏆 Excellent
ROUGE-1 0.982 🏆 Near-perfect
ROUGE-2 0.970 🏆 Near-perfect
ROUGE-L 0.982 🏆 Near-perfect

Training Dataset: 736 text pairs

  • Indian Constitution (500 pairs)
  • IPC - Indian Penal Code
  • CrPC - Criminal Procedure Code
  • Policy documents (200 pairs)
  • Bureaucratic texts (38 pairs)

💻 Technical Specifications

  • Model: T5-small (60M parameters)
  • Framework: PyTorch + HuggingFace Transformers
  • Training: GPU with FP16 mixed precision
  • Training Time: ~5 minutes on RTX 4070
  • Inference Speed: <1 second per text

🚀 Quick Start

Installation

# Clone repository
git clone <your-repo-url>
cd NLP_project

# Install dependencies
pip install -r requirements.txt

Training the Model

# Step 1: Generate training data
python create_training_data.py

# Step 2: Train model (fast - 5 minutes)
python train_final.py

# Step 3: Evaluate performance
python evaluate.py

Running the Demo

# Launch Gradio web interface
python demo.py

Then open http://localhost:7860 in your browser.

📁 Project Structure

NLP_project/
├── demo.py                          # Gradio web interface
├── train_final.py                   # Main training script
├── evaluate.py                      # Evaluation metrics (BLEU, ROUGE)
├── create_training_data.py          # Data preparation
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
└── aete_legal_model/                # Trained model (download separately)

🎯 Key Features

What Makes AETE Different from ChatGPT?

  1. Domain-Specific - Trained exclusively on Indian legal/bureaucratic texts
  2. Local & Private - Runs on your own hardware, no internet needed
  3. Consistent - Always produces the same style of simplification
  4. Fast - <1 second inference time
  5. Specialized Vocabulary - Optimized for Indian legal terminology
  6. Measurable - BLEU 95.9 specifically for legal simplification

📈 Training Process

The model was trained in three phases:

  1. Phase 1: Initial training on 38 bureaucratic text pairs (BLEU 65)
  2. Phase 2: Expanded with 200 policy documents (BLEU 92)
  3. Phase 3: Added 500 legal texts from Indian archive (BLEU 95.9)

Total training time: ~10 minutes across all phases

🔧 Requirements

torch>=2.0.0
transformers>=4.30.0
gradio>=4.0.0
pandas>=2.0.0
scikit-learn>=1.3.0
sacrebleu>=2.3.0
rouge-score>=0.1.2
textstat>=0.7.3
tqdm>=4.65.0

📊 Evaluation Metrics Explained

  • BLEU Score: Measures similarity to reference translations (0-100, higher is better)
  • ROUGE-1: Unigram (word) overlap with reference (0-1)
  • ROUGE-2: Bigram (word pair) overlap (0-1)
  • ROUGE-L: Longest common subsequence (0-1)
  • Readability: Flesch Reading Ease improvement

🎓 Use Cases

  • Simplifying government notifications
  • Making legal documents accessible
  • Converting policy documents to plain language
  • Helping citizens understand bureaucratic text
  • Legal education and training

📝 Model Architecture

T5-small Encoder-Decoder Architecture
├── Encoder: 6 layers, 512 hidden size
├── Decoder: 6 layers, 512 hidden size  
├── Parameters: 60,506,624
└── Vocabulary: 32,128 tokens

🔬 Training Details

  • Optimizer: AdamW
  • Learning Rate: 1.5e-4 with warmup
  • Batch Size: 4 (effective 16 with gradient accumulation)
  • Mixed Precision: FP16
  • Max Sequence Length: 256 tokens
  • Training Steps: 600
  • Validation Split: 80/20

📦 Model Download

Due to file size limitations, the trained model is not included in this repository.

To use the pre-trained model:

  1. Download from [model link]
  2. Extract to ./aete_legal_model/
  3. Run python demo.py

Or train your own model using train_final.py (~5 minutes on GPU).

🤝 Contributing

Contributions are welcome! Areas for improvement:

  • Adding more legal domain data
  • Supporting regional languages
  • Improving handling of very long documents
  • Creating additional evaluation metrics

📄 License

MIT License - see LICENSE file for details

👥 Authors

Created as part of NLP course project at Mahindra University

🙏 Acknowledgments

  • HuggingFace Transformers library
  • Indian legal text archive from Kaggle
  • T5 model by Google Research

📧 Contact

For questions or issues, please open a GitHub issue.


⭐ Star this repo if you find it useful!

About

NLP model for simplifying Indian legal & bureaucratic texts - BLEU 95.9 | T5-small | PyTorch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages