🧠 AETE - Administrative English Translation Engine

A specialized NLP model for simplifying Indian legal and bureaucratic texts

🎯 Project Overview

AETE converts complex bureaucratic, legal, and administrative English into simple, easy-to-understand language. The model is specifically fine-tuned on Indian legal texts including the Constitution, IPC, and CrPC.

Example:

Input: "The regulatory framework's multilevel governance model exhibits implementation challenges in resource-constrained environments."
Output: "The regulatory framework's multilevel governance model shows challenges in managing resources."

📊 Model Performance

Metric	Score	Status
BLEU Score	95.9/100	🏆 Excellent
ROUGE-1	0.982	🏆 Near-perfect
ROUGE-2	0.970	🏆 Near-perfect
ROUGE-L	0.982	🏆 Near-perfect

Training Dataset: 736 text pairs

Indian Constitution (500 pairs)
IPC - Indian Penal Code
CrPC - Criminal Procedure Code
Policy documents (200 pairs)
Bureaucratic texts (38 pairs)

💻 Technical Specifications

Model: T5-small (60M parameters)
Framework: PyTorch + HuggingFace Transformers
Training: GPU with FP16 mixed precision
Training Time: ~5 minutes on RTX 4070
Inference Speed: <1 second per text

🚀 Quick Start

Installation

# Clone repository
git clone <your-repo-url>
cd NLP_project

# Install dependencies
pip install -r requirements.txt

Training the Model

# Step 1: Generate training data
python create_training_data.py

# Step 2: Train model (fast - 5 minutes)
python train_final.py

# Step 3: Evaluate performance
python evaluate.py

Running the Demo

# Launch Gradio web interface
python demo.py

Then open http://localhost:7860 in your browser.

📁 Project Structure

NLP_project/
├── demo.py                          # Gradio web interface
├── train_final.py                   # Main training script
├── evaluate.py                      # Evaluation metrics (BLEU, ROUGE)
├── create_training_data.py          # Data preparation
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
└── aete_legal_model/                # Trained model (download separately)

🎯 Key Features

What Makes AETE Different from ChatGPT?

Domain-Specific - Trained exclusively on Indian legal/bureaucratic texts
Local & Private - Runs on your own hardware, no internet needed
Consistent - Always produces the same style of simplification
Fast - <1 second inference time
Specialized Vocabulary - Optimized for Indian legal terminology
Measurable - BLEU 95.9 specifically for legal simplification

📈 Training Process

The model was trained in three phases:

Phase 1: Initial training on 38 bureaucratic text pairs (BLEU 65)
Phase 2: Expanded with 200 policy documents (BLEU 92)
Phase 3: Added 500 legal texts from Indian archive (BLEU 95.9)

Total training time: ~10 minutes across all phases

🔧 Requirements

torch>=2.0.0
transformers>=4.30.0
gradio>=4.0.0
pandas>=2.0.0
scikit-learn>=1.3.0
sacrebleu>=2.3.0
rouge-score>=0.1.2
textstat>=0.7.3
tqdm>=4.65.0

📊 Evaluation Metrics Explained

BLEU Score: Measures similarity to reference translations (0-100, higher is better)
ROUGE-1: Unigram (word) overlap with reference (0-1)
ROUGE-2: Bigram (word pair) overlap (0-1)
ROUGE-L: Longest common subsequence (0-1)
Readability: Flesch Reading Ease improvement

🎓 Use Cases

Simplifying government notifications
Making legal documents accessible
Converting policy documents to plain language
Helping citizens understand bureaucratic text
Legal education and training

📝 Model Architecture

T5-small Encoder-Decoder Architecture
├── Encoder: 6 layers, 512 hidden size
├── Decoder: 6 layers, 512 hidden size  
├── Parameters: 60,506,624
└── Vocabulary: 32,128 tokens

🔬 Training Details

Optimizer: AdamW
Learning Rate: 1.5e-4 with warmup
Batch Size: 4 (effective 16 with gradient accumulation)
Mixed Precision: FP16
Max Sequence Length: 256 tokens
Training Steps: 600
Validation Split: 80/20

📦 Model Download

Due to file size limitations, the trained model is not included in this repository.

To use the pre-trained model:

Download from [model link]
Extract to ./aete_legal_model/
Run python demo.py

Or train your own model using train_final.py (~5 minutes on GPU).

🤝 Contributing

Contributions are welcome! Areas for improvement:

Adding more legal domain data
Supporting regional languages
Improving handling of very long documents
Creating additional evaluation metrics

📄 License

MIT License - see LICENSE file for details

👥 Authors

Created as part of NLP course project at Mahindra University

🙏 Acknowledgments

HuggingFace Transformers library
Indian legal text archive from Kaggle
T5 model by Google Research

📧 Contact

For questions or issues, please open a GitHub issue.

⭐ Star this repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
continue_training.py		continue_training.py
create_training_data.py		create_training_data.py
data_preprocessing.py		data_preprocessing.py
demo.py		demo.py
evaluate.py		evaluate.py
extract_legal_data.py		extract_legal_data.py
generate_legal_data.py		generate_legal_data.py
generate_more_data.py		generate_more_data.py
monitor.py		monitor.py
requirements.txt		requirements.txt
status.py		status.py
train_fast.py		train_fast.py
train_final.py		train_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 AETE - Administrative English Translation Engine

🎯 Project Overview

📊 Model Performance

💻 Technical Specifications

🚀 Quick Start

Installation

Training the Model

Running the Demo

📁 Project Structure

🎯 Key Features

What Makes AETE Different from ChatGPT?

📈 Training Process

🔧 Requirements

📊 Evaluation Metrics Explained

🎓 Use Cases

📝 Model Architecture

🔬 Training Details

📦 Model Download

🤝 Contributing

📄 License

👥 Authors

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 AETE - Administrative English Translation Engine

🎯 Project Overview

📊 Model Performance

💻 Technical Specifications

🚀 Quick Start

Installation

Training the Model

Running the Demo

📁 Project Structure

🎯 Key Features

What Makes AETE Different from ChatGPT?

📈 Training Process

🔧 Requirements

📊 Evaluation Metrics Explained

🎓 Use Cases

📝 Model Architecture

🔬 Training Details

📦 Model Download

🤝 Contributing

📄 License

👥 Authors

🙏 Acknowledgments

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages