Skip to content

Dev-31/WatchdogAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Watchdog AI

AI-Powered Data Quality & Misinformation Detection for Training Datasets

Python 3.8+ License: MIT Flask

Watchdog AI is a comprehensive data curation pipeline designed to clean and validate AI training datasets. It detects misinformation, assesses quality, removes duplicates, and tracks environmental sustainability β€” all without relying on external LLMs.


✨ Key Features

Feature Description
πŸ” Misinformation Detection Rule-based pattern matching for clickbait, conspiracy theories, toxic content
πŸ“Š Quality Scoring Multi-dimensional scoring: completeness, language quality, information density
πŸ”„ Duplicate Detection Exact (MD5 hash) + Semantic (TF-IDF cosine similarity) duplicate removal
🌍 Sustainability Tracking Carbon footprint calculation via Climatiq API integration
πŸš€ REST API Production-ready Flask API with CORS support
πŸ–₯️ Web UI Interactive frontend for testing and demonstrations

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        INPUT DATASET                            β”‚
β”‚                    (CSV / JSON / JSONL)                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 1: Misinformation Detection                               β”‚
β”‚  β”œβ”€β”€ Suspicious pattern matching (regex)                        β”‚
β”‚  β”œβ”€β”€ Clickbait phrase detection                                 β”‚
β”‚  β”œβ”€β”€ Toxicity identification                                    β”‚
β”‚  └── Source credibility scoring                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 2: Quality Assessment                                     β”‚
β”‚  β”œβ”€β”€ Text completeness (required + optional fields)             β”‚
β”‚  β”œβ”€β”€ Language quality (capitalization, punctuation)             β”‚
β”‚  β”œβ”€β”€ Information density (lexical diversity)                    β”‚
β”‚  └── Spam indicator detection                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 3: Duplicate Removal                                      β”‚
β”‚  β”œβ”€β”€ Exact duplicates (MD5 hashing)                             β”‚
β”‚  └── Semantic duplicates (TF-IDF + Cosine similarity)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 4: Sustainability Impact                                  β”‚
β”‚  β”œβ”€β”€ Data reduction percentage                                  β”‚
β”‚  β”œβ”€β”€ Energy savings (kWh)                                       β”‚
β”‚  └── Carbon footprint (kg COβ‚‚) via Climatiq API                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      CLEANED DATASET                            β”‚
β”‚              + Statistics + Sustainability Report               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

1. Clone & Install

git clone https://github.com/Dev-31/WatchdogAI.git
cd WatchdogAI
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env and add your CLIMATIQ_API_KEY (optional)

3. Run the API Server

python api/app.py

API will be available at: http://localhost:5000

4. Open the UI

Open frontend.html in your browser to access the interactive interface.


πŸ“‘ API Endpoints

Endpoint Method Description
/ GET API information
/health GET Health check
/analyze POST Analyze single text for misinformation + quality
/analyze/batch POST Batch analysis of multiple texts
/quality POST Quality scoring only
/duplicates POST Find duplicates in text list
/process POST Full 4-step pipeline processing
/sustainability POST Calculate carbon savings

Example: Analyze Text

curl -X POST http://localhost:5000/analyze \
  -H "Content-Type: application/json" \
  -d '{"text": "Your text here", "source": "example.com"}'

Response:

{
  "status": "clean",
  "misinformation_score": 0.12,
  "confidence": 0.45,
  "risk_level": "low",
  "quality_score": 0.82,
  "quality_level": "high",
  "flags": [],
  "explanations": ["No significant misinformation indicators detected"]
}

πŸ”§ Core Modules

MisinformationDetector

Detects problematic content using regex pattern matching:

  • Suspicious patterns: Conspiracy theories, miracle cures, sensationalism
  • Clickbait phrases: "You won't believe", "Number X will shock you"
  • Toxicity: Hate speech, insults, aggressive language
  • Source credibility: Boosts .gov, .edu; penalizes suspicious domains

DataQualityScorer

Evaluates structural quality with weighted scoring:

  • Information density: 28%
  • Language quality: 25%
  • Word count: 13%
  • Completeness: 12%
  • Spam check: 12%
  • Text length: 10%

RedundancyDetector

Two-stage duplicate detection:

  1. Exact: MD5 hash-based O(1) lookup
  2. Semantic: TF-IDF vectorization + cosine similarity (threshold: 85%)

SustainabilityTracker

Environmental impact calculation:

  • Integrates with Climatiq API for real carbon data
  • Fallback to regional averages (Global: 0.475 kg COβ‚‚/kWh)
  • Tracks energy, carbon, and water usage

πŸ“‹ Detection Triggers Reference

Type Examples
Clickbait "you won't believe", "shocking", "number X will..."
Conspiracy "deep state", "illuminati", "wake up sheeple"
Toxicity "idiot", "stupid", "hate", "disaster", "fraud"
Spam "FREE", "CLICK HERE", "GUARANTEED", "ACT NOW"
Low Quality Excessive filler words: "good", "stuff", "very"
Excessive Caps >40% uppercase letters
Excessive Punctuation Multiple !!! or ???

πŸ“„ Documentation

Document Description
Technical Walkthrough (PDF) Complete code explanation for every function
Demo Inputs (PDF) 10 test cases for live demonstrations

πŸ§ͺ Testing

# Run quick check
python quick_check.py

# Run verification checkpoints
python verify_checkpoints.py

# Run tests
pytest tests/

πŸ“ Project Structure

WatchdogAI/
β”œβ”€β”€ api/
β”‚   └── app.py              # Flask REST API
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ misinformation_detector.py
β”‚   β”œβ”€β”€ quality_scorer.py
β”‚   β”œβ”€β”€ redundancy_detector.py
β”‚   β”œβ”€β”€ sustainability_tracker.py
β”‚   └── dataset_processor.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ generate_walkthrough_pdf.py
β”‚   └── generate_demo_inputs_pdf.py
β”œβ”€β”€ frontend.html           # Web UI
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── README.md

🌍 Sustainability Impact

When you clean your dataset with Watchdog AI, you receive:

  • Immediate savings: Data reduction %, energy saved (kWh), carbon saved (kg COβ‚‚)
  • Annual projections: Extrapolated yearly impact
  • Equivalencies: Trees planted, car miles avoided

βš™οΈ Configuration

Environment Variable Description Required
CLIMATIQ_API_KEY API key for carbon calculations Optional

Without the API key, sustainability tracking uses regional fallback values.


🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


Built with ❀️ for cleaner AI training data

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors