An advanced system for reducing hallucinations in Large Language Models using RAG, Discriminator-Guided Reinforcement Learning, and robust data generation pipelines.
Features β’ Architecture β’ Installation β’ Quick Start β’ Documentation β’ Contributing
AI4Org is a comprehensive machine learning system designed to tackle one of the most critical challenges in modern AI: hallucinations in Large Language Models (LLMs). The project combines cutting-edge techniques including:
- Retrieval-Augmented Generation (RAG) with semantic search
- Multi-Discriminator Architecture for factuality, style, and safety validation
- REINFORCE-based Reinforcement Learning for model fine-tuning
- Automated Q&A Data Generation Pipeline using Qwen 7B
- Desktop Application with interactive chat interface
This system is particularly designed for organizational policy documents, banking regulations, and other domains where factual accuracy is paramount.
- RAG Integration: Retrieves relevant context using TF-IDF and sentence embeddings
- Triple Discriminator System:
- Factuality Discriminator: Validates factual accuracy against source documents
- Style Discriminator: Ensures professional and appropriate language
- Safety Discriminator: Filters unsafe or inappropriate content
- Reinforcement Learning: REINFORCE algorithm with discriminator-guided rewards
- Supervised Fine-Tuning (SFT): Initial training on Q&A pairs before RL
- Multi-GPU Support: Automatic DataParallel for distributed training
- Gradient Checkpointing: Memory-efficient training for large models
- Automated Q&A Generation: Uses Qwen 7B (4-bit quantized) for generating question-answer pairs
- Document Processing: Supports PDF and TXT files with advanced cleaning
- Smart Chunking: Configurable chunking with overlap for context preservation
- Evidence Extraction: Sentence-level evidence linking using semantic similarity
- Deduplication: Exact and semantic deduplication using FAISS
- Validation: Strict quality checks for section numbers, policy names, and acronyms
- Cross-Platform: Built with
pywebviewfor Windows, macOS, and Linux - Interactive Chat: Real-time RAG-enhanced question answering
- User Management: Login system with history tracking
- Admin Dashboard: User statistics and login analytics
- File Upload: Upload documents for training directly from the UI
- Modern UI: Responsive design with HTML/CSS/JavaScript
- Unit Tests: Comprehensive coverage for core components
- Integration Tests: End-to-end pipeline validation
- E2E Tests: Full system testing
- Automated Testing: pytest-based test suite
graph TB
subgraph "Data Pipeline"
A[Raw Documents] --> B[Text Extraction]
B --> C[Cleaning & Chunking]
C --> D[Q&A Generation<br/>Qwen 7B]
D --> E[Evidence Extraction]
E --> F[Deduplication]
F --> G[Validated Q&A Pairs]
end
subgraph "Training Pipeline"
G --> H[Corpus Building]
H --> I[Discriminator Training]
I --> J[Supervised Fine-Tuning]
J --> K[RL Loop<br/>REINFORCE]
K --> L[Fine-tuned Generator]
end
subgraph "Inference"
M[User Query] --> N[RAG Retrieval]
N --> O[Context + Query]
L --> P[Generator]
O --> P
P --> Q[Response]
end
subgraph "Frontend"
R[Desktop App] --> M
Q --> R
end
style A fill:#e1f5ff
style G fill:#e1f5ff
style L fill:#c8e6c9
style Q fill:#fff9c4
style R fill:#f8bbd0
| Component | Technology | Purpose |
|---|---|---|
| Generator | TinyLlama-1.1B-Chat | Causal language model for answer generation |
| Discriminators | DistilBERT | Binary classifiers for quality validation |
| Retriever | Sentence-Transformers | Semantic search for relevant context |
| Data Pipeline | Qwen 7B (4-bit) | Automated Q&A pair generation |
| Frontend | pywebview + HTML/CSS/JS | Cross-platform desktop application |
| RL Training | REINFORCE | Policy gradient optimization |
- Python: 3.10 or higher
- CUDA: 11.8+ (recommended for GPU training)
- RAM: 16GB minimum, 32GB recommended
- GPU: NVIDIA GPU with 8GB+ VRAM (optional but recommended)
-
Clone the repository
git clone https://github.com/merledu/ai4org.git cd ai4org -
Create a virtual environment
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Verify installation Run the included validation script to check your environment:
python scripts/validate_setup.py
This script will check your Python version, installed packages, data files, and environment variables.
This project requires sentence-transformers and accelerate.
If inference fails, ensure they are installed:
pip install sentence-transformers accelerate
Train the complete pipeline (discriminators β SFT β RL):
python -m hallucination_reduction.mainWhat happens during training:
- Loads corpus and Q&A pairs from
data/processed/ - Trains three discriminators (factuality, style, safety)
- Performs supervised fine-tuning on Q&A pairs
- Runs reinforcement learning with discriminator feedback
- Saves models to
saved_models_improved/
Training Configuration (edit hallucination_reduction/config.py):
SFT_EPOCHS = 4 # Supervised fine-tuning epochs
SFT_BATCH = 1 # Batch size for SFT
DISC_EPOCHS = 4 # Discriminator training epochs
RL_EPOCHS = 4 # Reinforcement learning epochs
MC_ROLLOUTS = 6 # Monte Carlo rollouts per sampleChat with the trained model via terminal:
python -m hallucination_reduction.inferenceExample interaction:
π€ RAG-Enhanced Chat Mode Started β type 'exit' to quit.
π€ You: What is the data retention policy?
π Retrieved 3 relevant context chunks.
[1] Data retention policies specify that customer records...
[2] Personal information is stored for a maximum of...
[3] Upon user request, all data will be deleted within...
π€ Generator: Based on the organization's data retention policy,
customer records are maintained for the duration required by
regulatory compliance, typically 7 years for financial records...
Start the interactive GUI:
cd frontend
pip install -r requirements.txt
python main.pyFeatures:
- π User login and registration
- π¬ Interactive chat with RAG-enhanced responses
- π€ Document upload for training
- π¨βπΌ Admin dashboard (PIN: 9999)
- π User statistics and login history
Create training data from policy documents:
cd data_generation_pipeline/data-pipeline
# Place your PDF/TXT files in data/input/
python -m cli run \
--input data/input/your_policy.pdf \
--output data/output/qa_pairs.jsonPipeline stages:
- Text extraction from PDF/TXT
- Document cleaning and normalization
- Smart chunking with overlap
- Q&A generation using Qwen 7B
- Evidence sentence extraction
- Semantic deduplication
- Quality validation
We have detailed documentation for each component of the system:
- System Architecture: High-level design, component interaction, and data flow diagrams.
- Hallucination Reduction: Deep dive into the generator, discriminators, and RL training pipeline.
- Data Generation Pipeline: How to generate training data from your own policy documents.
- Frontend Application: User guide and setup instructions for the desktop app.
- API Reference: Technical reference for key classes and functions.
ai4org/
βββ π docs/ # π Project documentation
βββ π hallucination_reduction/ # Core ML pipeline
βββ π data_generation_pipeline/ # Q&A generation system
βββ π frontend/ # Desktop application
βββ π data/ # Data storage
βββ π tests/ # Test suite
βββ ...
# Run complete test suite
pytest tests/
# Run with coverage report
pytest --cov=hallucination_reduction --cov-report=html tests/
# Run specific test categories
pytest tests/unit/ # Unit tests only
pytest tests/integration/ # Integration tests only
pytest tests/e2e/ # End-to-end tests onlyThe project includes comprehensive tests for:
- β Data loading and preprocessing
- β Discriminator training and evaluation
- β Generator fine-tuning
- β RAG retrieval accuracy
- β RL training loop
- β Inference pipeline
- β Q&A generation pipeline
- β Frontend API endpoints
# Model selection
export GEN_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0"
export DISC_MODEL="distilbert-base-uncased"
# GPU configuration
export CUDA_VISIBLE_DEVICES="0,1" # Use specific GPUs
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:true"
# Disable tokenizer warnings
export TOKENIZERS_PARALLELISM="false"Edit hallucination_reduction/config.py:
# Supervised Fine-Tuning
SFT_EPOCHS = 4
SFT_BATCH = 1
SFT_LR = 3e-5
# Discriminator Training
DISC_EPOCHS = 4
DISC_BATCH = 8
DISC_LR = 2e-5
# Reinforcement Learning
RL_EPOCHS = 4
MC_ROLLOUTS = 6
GEN_LR = 1e-5
MAX_GEN_TOKENS = 64
# Reward Weights
FACT_WEIGHT = 0.8
STYLE_WEIGHT = 0.15
SAFETY_WEIGHT = 0.05Edit data_generation_pipeline/data-pipeline/config/pipeline_config.yaml:
chunking:
chunk_size: 512
overlap: 128
generation:
max_qa_per_chunk: 5
temperature: 0.7
deduplication:
similarity_threshold: 0.85| Metric | Baseline | After SFT | After RL | Improvement |
|---|---|---|---|---|
| Hallucination Rate | 45.2% | 28.7% | 12.3% | -32.9% |
| Factuality Score | 0.62 | 0.78 | 0.91 | +0.29 |
| Style Score | 0.71 | 0.84 | 0.89 | +0.18 |
| Safety Score | 0.88 | 0.93 | 0.96 | +0.08 |
- Discriminator Training: ~15 minutes
- Supervised Fine-Tuning: ~30 minutes
- Reinforcement Learning: ~45 minutes
- Total: ~1.5 hours
We welcome contributions from the community! Please see CONTRIBUTING.md for detailed guidelines.
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Run tests
pytest tests/
- Commit with clear messages
git commit -m "Add amazing feature" - Push to your fork
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 guidelines
- Use meaningful variable and function names
- Add docstrings to all functions and classes
- Include type hints where appropriate
- Write tests for new features
This project is open-source and available under the MIT License. See LICENSE for details.
- Hugging Face for Transformers library and model hosting
- PyTorch team for the deep learning framework
- Sentence-Transformers for semantic search capabilities
- TinyLlama project for the efficient language model
- Qwen team for the Q&A generation model
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Organization: MeRL-EDU
- Support for larger models (Llama 2, Mistral)
- Multi-language support
- Web-based deployment option
- Real-time streaming inference
- Advanced evaluation metrics
- Model distillation for edge deployment
- Integration with vector databases (Pinecone, Weaviate)
- API server with FastAPI
- Docker containerization
- Kubernetes deployment templates
Built with β€οΈ by the MeRL-EDU Team
β Star us on GitHub β it motivates us a lot!
