AI-Powered Data Quality & Misinformation Detection for Training Datasets
Watchdog AI is a comprehensive data curation pipeline designed to clean and validate AI training datasets. It detects misinformation, assesses quality, removes duplicates, and tracks environmental sustainability — all without relying on external LLMs.
| Feature | Description |
|---|---|
| 🔍 Misinformation Detection | Rule-based pattern matching for clickbait, conspiracy theories, toxic content |
| 📊 Quality Scoring | Multi-dimensional scoring: completeness, language quality, information density |
| 🔄 Duplicate Detection | Exact (MD5 hash) + Semantic (TF-IDF cosine similarity) duplicate removal |
| 🌍 Sustainability Tracking | Carbon footprint calculation via Climatiq API integration |
| 🚀 REST API | Production-ready Flask API with CORS support |
| 🖥️ Web UI | Interactive frontend for testing and demonstrations |
┌─────────────────────────────────────────────────────────────────┐
│ INPUT DATASET │
│ (CSV / JSON / JSONL) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Misinformation Detection │
│ ├── Suspicious pattern matching (regex) │
│ ├── Clickbait phrase detection │
│ ├── Toxicity identification │
│ └── Source credibility scoring │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: Quality Assessment │
│ ├── Text completeness (required + optional fields) │
│ ├── Language quality (capitalization, punctuation) │
│ ├── Information density (lexical diversity) │
│ └── Spam indicator detection │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: Duplicate Removal │
│ ├── Exact duplicates (MD5 hashing) │
│ └── Semantic duplicates (TF-IDF + Cosine similarity) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STEP 4: Sustainability Impact │
│ ├── Data reduction percentage │
│ ├── Energy savings (kWh) │
│ └── Carbon footprint (kg CO₂) via Climatiq API │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ CLEANED DATASET │
│ + Statistics + Sustainability Report │
└─────────────────────────────────────────────────────────────────┘
git clone https://github.com/Dev-31/WatchdogAI.git
cd WatchdogAI
pip install -r requirements.txtcp .env.example .env
# Edit .env and add your CLIMATIQ_API_KEY (optional)python api/app.pyAPI will be available at: http://localhost:5000
Open frontend.html in your browser to access the interactive interface.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API information |
/health |
GET | Health check |
/analyze |
POST | Analyze single text for misinformation + quality |
/analyze/batch |
POST | Batch analysis of multiple texts |
/quality |
POST | Quality scoring only |
/duplicates |
POST | Find duplicates in text list |
/process |
POST | Full 4-step pipeline processing |
/sustainability |
POST | Calculate carbon savings |
curl -X POST http://localhost:5000/analyze \
-H "Content-Type: application/json" \
-d '{"text": "Your text here", "source": "example.com"}'Response:
{
"status": "clean",
"misinformation_score": 0.12,
"confidence": 0.45,
"risk_level": "low",
"quality_score": 0.82,
"quality_level": "high",
"flags": [],
"explanations": ["No significant misinformation indicators detected"]
}Detects problematic content using regex pattern matching:
- Suspicious patterns: Conspiracy theories, miracle cures, sensationalism
- Clickbait phrases: "You won't believe", "Number X will shock you"
- Toxicity: Hate speech, insults, aggressive language
- Source credibility: Boosts
.gov,.edu; penalizes suspicious domains
Evaluates structural quality with weighted scoring:
- Information density: 28%
- Language quality: 25%
- Word count: 13%
- Completeness: 12%
- Spam check: 12%
- Text length: 10%
Two-stage duplicate detection:
- Exact: MD5 hash-based O(1) lookup
- Semantic: TF-IDF vectorization + cosine similarity (threshold: 85%)
Environmental impact calculation:
- Integrates with Climatiq API for real carbon data
- Fallback to regional averages (Global: 0.475 kg CO₂/kWh)
- Tracks energy, carbon, and water usage
| Type | Examples |
|---|---|
| Clickbait | "you won't believe", "shocking", "number X will..." |
| Conspiracy | "deep state", "illuminati", "wake up sheeple" |
| Toxicity | "idiot", "stupid", "hate", "disaster", "fraud" |
| Spam | "FREE", "CLICK HERE", "GUARANTEED", "ACT NOW" |
| Low Quality | Excessive filler words: "good", "stuff", "very" |
| Excessive Caps | >40% uppercase letters |
| Excessive Punctuation | Multiple !!! or ??? |
| Document | Description |
|---|---|
| Technical Walkthrough (PDF) | Complete code explanation for every function |
| Demo Inputs (PDF) | 10 test cases for live demonstrations |
# Run quick check
python quick_check.py
# Run verification checkpoints
python verify_checkpoints.py
# Run tests
pytest tests/WatchdogAI/
├── api/
│ └── app.py # Flask REST API
├── src/
│ ├── misinformation_detector.py
│ ├── quality_scorer.py
│ ├── redundancy_detector.py
│ ├── sustainability_tracker.py
│ └── dataset_processor.py
├── docs/
│ ├── generate_walkthrough_pdf.py
│ └── generate_demo_inputs_pdf.py
├── frontend.html # Web UI
├── requirements.txt
├── .env.example
└── README.md
When you clean your dataset with Watchdog AI, you receive:
- Immediate savings: Data reduction %, energy saved (kWh), carbon saved (kg CO₂)
- Annual projections: Extrapolated yearly impact
- Equivalencies: Trees planted, car miles avoided
| Environment Variable | Description | Required |
|---|---|---|
CLIMATIQ_API_KEY |
API key for carbon calculations | Optional |
Without the API key, sustainability tracking uses regional fallback values.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Climatiq for carbon emissions API
- scikit-learn for TF-IDF vectorization
- Flask for the REST API framework
Built with ❤️ for cleaner AI training data