AI-Powered Data Quality & Misinformation Detection for Training Datasets
Watchdog AI is a comprehensive data curation pipeline designed to clean and validate AI training datasets. It detects misinformation, assesses quality, removes duplicates, and tracks environmental sustainability β all without relying on external LLMs.
| Feature | Description |
|---|---|
| π Misinformation Detection | Rule-based pattern matching for clickbait, conspiracy theories, toxic content |
| π Quality Scoring | Multi-dimensional scoring: completeness, language quality, information density |
| π Duplicate Detection | Exact (MD5 hash) + Semantic (TF-IDF cosine similarity) duplicate removal |
| π Sustainability Tracking | Carbon footprint calculation via Climatiq API integration |
| π REST API | Production-ready Flask API with CORS support |
| π₯οΈ Web UI | Interactive frontend for testing and demonstrations |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT DATASET β
β (CSV / JSON / JSONL) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: Misinformation Detection β
β βββ Suspicious pattern matching (regex) β
β βββ Clickbait phrase detection β
β βββ Toxicity identification β
β βββ Source credibility scoring β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: Quality Assessment β
β βββ Text completeness (required + optional fields) β
β βββ Language quality (capitalization, punctuation) β
β βββ Information density (lexical diversity) β
β βββ Spam indicator detection β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: Duplicate Removal β
β βββ Exact duplicates (MD5 hashing) β
β βββ Semantic duplicates (TF-IDF + Cosine similarity) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: Sustainability Impact β
β βββ Data reduction percentage β
β βββ Energy savings (kWh) β
β βββ Carbon footprint (kg COβ) via Climatiq API β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLEANED DATASET β
β + Statistics + Sustainability Report β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
git clone https://github.com/Dev-31/WatchdogAI.git
cd WatchdogAI
pip install -r requirements.txtcp .env.example .env
# Edit .env and add your CLIMATIQ_API_KEY (optional)python api/app.pyAPI will be available at: http://localhost:5000
Open frontend.html in your browser to access the interactive interface.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API information |
/health |
GET | Health check |
/analyze |
POST | Analyze single text for misinformation + quality |
/analyze/batch |
POST | Batch analysis of multiple texts |
/quality |
POST | Quality scoring only |
/duplicates |
POST | Find duplicates in text list |
/process |
POST | Full 4-step pipeline processing |
/sustainability |
POST | Calculate carbon savings |
curl -X POST http://localhost:5000/analyze \
-H "Content-Type: application/json" \
-d '{"text": "Your text here", "source": "example.com"}'Response:
{
"status": "clean",
"misinformation_score": 0.12,
"confidence": 0.45,
"risk_level": "low",
"quality_score": 0.82,
"quality_level": "high",
"flags": [],
"explanations": ["No significant misinformation indicators detected"]
}Detects problematic content using regex pattern matching:
- Suspicious patterns: Conspiracy theories, miracle cures, sensationalism
- Clickbait phrases: "You won't believe", "Number X will shock you"
- Toxicity: Hate speech, insults, aggressive language
- Source credibility: Boosts
.gov,.edu; penalizes suspicious domains
Evaluates structural quality with weighted scoring:
- Information density: 28%
- Language quality: 25%
- Word count: 13%
- Completeness: 12%
- Spam check: 12%
- Text length: 10%
Two-stage duplicate detection:
- Exact: MD5 hash-based O(1) lookup
- Semantic: TF-IDF vectorization + cosine similarity (threshold: 85%)
Environmental impact calculation:
- Integrates with Climatiq API for real carbon data
- Fallback to regional averages (Global: 0.475 kg COβ/kWh)
- Tracks energy, carbon, and water usage
| Type | Examples |
|---|---|
| Clickbait | "you won't believe", "shocking", "number X will..." |
| Conspiracy | "deep state", "illuminati", "wake up sheeple" |
| Toxicity | "idiot", "stupid", "hate", "disaster", "fraud" |
| Spam | "FREE", "CLICK HERE", "GUARANTEED", "ACT NOW" |
| Low Quality | Excessive filler words: "good", "stuff", "very" |
| Excessive Caps | >40% uppercase letters |
| Excessive Punctuation | Multiple !!! or ??? |
| Document | Description |
|---|---|
| Technical Walkthrough (PDF) | Complete code explanation for every function |
| Demo Inputs (PDF) | 10 test cases for live demonstrations |
# Run quick check
python quick_check.py
# Run verification checkpoints
python verify_checkpoints.py
# Run tests
pytest tests/WatchdogAI/
βββ api/
β βββ app.py # Flask REST API
βββ src/
β βββ misinformation_detector.py
β βββ quality_scorer.py
β βββ redundancy_detector.py
β βββ sustainability_tracker.py
β βββ dataset_processor.py
βββ docs/
β βββ generate_walkthrough_pdf.py
β βββ generate_demo_inputs_pdf.py
βββ frontend.html # Web UI
βββ requirements.txt
βββ .env.example
βββ README.md
When you clean your dataset with Watchdog AI, you receive:
- Immediate savings: Data reduction %, energy saved (kWh), carbon saved (kg COβ)
- Annual projections: Extrapolated yearly impact
- Equivalencies: Trees planted, car miles avoided
| Environment Variable | Description | Required |
|---|---|---|
CLIMATIQ_API_KEY |
API key for carbon calculations | Optional |
Without the API key, sustainability tracking uses regional fallback values.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Climatiq for carbon emissions API
- scikit-learn for TF-IDF vectorization
- Flask for the REST API framework
Built with β€οΈ for cleaner AI training data