DataDetox - AC215 Project

Team Members: Kushal Chattopadhyay, Keyu Wang, Terry Zhou

Group Name: DataDetox

Project Description

DataDetox is an interactive AI agent orchestration system that leverages MCP, graph-based data, and cloud infrastructure to trace ML data and model provenance, enabling practitioners to explore AI model lineages where otherwise there would exist a critical lack of transparency in model training data and upstream dependencies. Using Hugging Face model information and arXiv papers, the system traces how datasets and models connect across the AI ecosystem, helping developers identify hidden risks like copyrighted data or problematic datasets (e.g., LAION-5B, MS-Celeb-1M) that propagate downstream. Users may ask questions like “Tell me about any datasets or upstream models associated with Qwen3-4B” to assess model risk, receiving visual dependency graphs and clear answers instead of manually piecing together scattered documentation.

To learn more, read our Medium blog post and watch our video demo!

Documentation (in `/docs/`)

Application Design Document
Data Versioning Documentation
Test Coverage Documentation
Github Automated Deployment Demonstration

Code and Configuration

All source code is organized in the repository:

Frontend: /frontend/
- React + TypeScript application
- Components in frontend/src/components/
- Pages in frontend/src/pages/
Backend: /backend/
- FastAPI application
- Agent implementation in backend/routers/search/
- Tests in backend/tests/
Data Pipeline: /model-lineage/
- HuggingFace scraper
- Graph builder and Neo4j client
- Tests in model-lineage/tests/
CI/CD Configuration: .github/workflows/
- backend-ci.yml - Backend testing and linting
- model-lineage-ci.yml - Data pipeline testing
- Automated testing on push/PR
Docker Configuration:
- docker-compose.yml - Multi-service orchestration
- Individual Dockerfiles in each service directory
Dummy Neo4j Container: /neo4j/
- Dummy neo4j container directly pulled from the official Docker image for deployment

CI Evidence

Passing CI Run Screenshots:

Located in docs/ms4/cicd/
Test Coverage in docs/ms5/test-coverage.md

View Live CI/CD:

Check the GitHub Actions tab for recent workflow runs
Coverage reports available in workflow artifacts

Visualizations

Landing Page:
Chatbot Page:
Agentic Workflow:

Prerequisites

Required Software

Docker Desktop installed (Get Docker)
Docker Compose (included with Docker Desktop)
Python 3.13+ (for local development)
Node.js 18+ and npm (for frontend development)
uv package manager (Install uv)

For Production Deployment (Kubernetes)

Google Cloud Platform (GCP) Account with billing enabled
GCP Project with the following APIs enabled:
- Compute Engine API
- Service Usage API
- Cloud Resource Manager API
- Artifact Registry API
- Kubernetes Engine API
Pulumi CLI (installed in deployment container)
gcloud CLI (installed in deployment container)
kubectl (installed in deployment container)

API Keys and Credentials

OpenAI API Key - Required for LLM-based dataset extraction
- Get from: https://platform.openai.com/api-keys
HuggingFace Token - Required for accessing HuggingFace Hub
- Get from: https://huggingface.co/settings/tokens
Neo4j Credentials - Required for graph database
- Option 1: Use Neo4j AuraDB (cloud) - Get from https://neo4j.com/cloud/aura/
- Option 2: Use local Neo4j instance via Docker Compose

GCP Service Accounts (for Production Deployment)

Two service accounts are required for Kubernetes deployment:

deployment service account with roles:
- Compute Admin
- Compute OS Login
- Artifact Registry Administrator
- Kubernetes Engine Admin
- Service Account Admin
- Service Account User
- Storage Admin
gcp-service service account with roles:
- Storage Object Viewer
- Vertex AI Administrator
- Artifact Registry Reader

See deployment/README.md for detailed service account setup instructions.

Environment Configuration

Copy the environment file:
```
cp .env.example .env
```

Edit .env with your credentials:

# Required: Your OpenAI API key
OPENAI_API_KEY=sk-proj-...

# Required: Your HuggingFace token
HF_TOKEN=hf_...

# Required: Your Neo4j credentials
NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

Note: For local Neo4j, you can use the Docker Compose setup which provides a local instance. For cloud Neo4j, use your cloud instance URI.

Setup Instructions

Option 1: Docker Compose (Recommended for Local Development)

This is the easiest way to run the full application stack locally.

Copy environment file:
```
cp .env.example .env
```
Edit .env with your credentials (see Environment Configuration above)

Start all services:

docker compose up --build

Or run in detached mode:

docker compose up -d --build

Access the application:
- Frontend: http://localhost:3000
- Chatbot: http://localhost:3000/chatbot
- Backend API: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Neo4j Browser (if using local): http://localhost:7474

Option 2: Local Development (Individual Services)

For developers who want to run services separately for debugging or development:

Model Lineage Setup

Navigate to the model-lineage directory
Follow the instructions in model-lineage/QUICKSTART.md to set up DVC and Neo4j
Run docker compose up to create your Neo4j instance with DVC

Backend Setup

Navigate to the backend directory
Install dependencies:
```
uv sync
```

Run the FastAPI development server:

uv run fastapi dev main.py

For production, use:

uv run uvicorn main:app --host 0.0.0.0 --port 8000

Test the API at http://localhost:8000/docs

Frontend Setup

Navigate to the frontend directory
Install dependencies:
```
npm install
```
Start the development server:
```
npm run dev
```
Access the frontend at http://localhost:3000

Deployment Instructions

Production Deployment to Kubernetes (GCP)

DataDetox is deployed to Google Kubernetes Engine (GKE) using Pulumi for infrastructure as code. The deployment process is automated through GitHub Actions CI/CD.

graph LR
    B[Browser] -->|ip.sslip.io| LB[LoadBalancer Service<br/>External IP]
    LB --> I[Nginx Ingress Controller]
    I -->|/ path| F[Frontend Service<br/>ClusterIP:3000]
    I -->|/backend/ path| A[Backend Service<br/>ClusterIP:8000]
    A -->|bolt://neo4j:7687| N[Neo4j Service<br/>ClusterIP:7474/7687]
    N -.->|one-time load| J[Model Lineage Job]

    style LB fill:#yellow
    style I fill:#lightblue
    style F fill:#lightgreen
    style A fill:#lightgreen
    style N fill:#lightgreen
    style J fill:#orange

Prerequisites for Deployment

Set up GCP Service Accounts (see Prerequisites above)
Place service account keys in the secrets/ directory:
- secrets/deployment.json - Deployment service account key
- secrets/gcp-service.json - GCP service account key
Configure deployment container:
- Edit deployment/docker-shell.sh and set GCP_PROJECT to your project ID

Step 1: Build and Push Docker Images

Enter the deployment container:
```
cd deployment
sh docker-shell.sh
```

Build and push images to Google Container Registry:

cd deploy_images
pulumi stack init dev  # First time only
pulumi config set gcp:project <your-project-id> --stack dev
pulumi up --stack dev -y

This builds and pushes:

Backend container
Frontend container
Neo4j container
Model-lineage container

Step 2: Deploy to Kubernetes Cluster

Deploy the Kubernetes cluster and services:

cd ../deploy_k8s
pulumi stack init dev  # First time only
pulumi config set gcp:project <your-project-id>
pulumi config set security:gcp_service_account_email deployment@<your-project-id>.iam.gserviceaccount.com --stack dev
pulumi config set security:gcp_ksa_service_account_email gcp-service@<your-project-id>.iam.gserviceaccount.com --stack dev
pulumi up --stack dev --refresh -y

Get the application URL from Pulumi output:

Outputs:
    app_url: "http://34.9.143.147.sslip.io"

Access your deployed application at the app_url

Step 3: Populate Neo4j Database (Optional)

The model-lineage deployment starts with 0 replicas. To populate the Neo4j database:

# Get cluster credentials
gcloud container clusters get-credentials <cluster-name> --zone us-central1-a --project <project-id>

# Scale up the model-lineage deployment
kubectl scale deployment/model-lineage --replicas=1 -n datadetox-namespace

# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=model-lineage -n datadetox-namespace

# Exec into the container and run scraper
kubectl exec -it deployment/model-lineage -n datadetox-namespace -- bash
uv run python lineage_scraper.py --full --limit 25000 # WARNING: This will take a long time, adjust the limit as needed

# Scale back down when done
kubectl scale deployment/model-lineage --replicas=0 -n datadetox-namespace

Automated Deployment via CI/CD

The application is automatically deployed when changes are merged to the main branch:

GitHub Actions workflow runs:
- Unit tests (backend, model-lineage)
- Integration tests
- Coverage checks (minimum 60%)
- Docker image builds
- Kubernetes deployment
View deployment status:
- Check GitHub Actions tab: https://github.com/kushal-chat/actions
- View workflow: .github/workflows/datadetox-cicd.yml

Scaling the Application

The Kubernetes deployment supports horizontal pod autoscaling. It automatically scales the number of pods in response to load. To manually scale:

# Scale backend
kubectl scale deployment/backend --replicas=3 -n datadetox-namespace

# Scale frontend
kubectl scale deployment/frontend --replicas=2 -n datadetox-namespace

# View current replicas
kubectl get deployments -n datadetox-namespace

For detailed deployment instructions, see deployment/README.md.

Usage Guidelines

Running the Full-Stack Application Locally

Start all services using Docker Compose:
```
docker compose up
```
Access the chatbot:
- Navigate to http://localhost:3000/chatbot
- Enter queries about models (e.g., "Tell me about BERT models" or "What datasets were used to train qwen3-4b?")
- The system will:
  - Search HuggingFace for model information
  - Query Neo4j for model lineage relationships
  - Extract training dataset information from arXiv papers
  - Assess dataset risks
  - Display results in an interactive graph
Explore the API:
- Visit http://localhost:8000/docs for interactive API documentation
- Test the /backend/flow/search endpoint with sample queries

Example Queries

Here are some example queries you can try:

Model Information: "Tell me about BERT models"
Model Lineage: "What are the upstream dependencies of qwen3-4b?"
Dataset Extraction: "What models are trained on GSM8K?"
Graph Exploration: "Show me the complete lineage graph for this model"

Common Commands

Docker Compose Commands

# Start services
docker compose up

# Start in background
docker compose up -d

# Rebuild containers
docker compose up --build

# Stop services
docker compose down

# View logs
docker compose logs -f

# View specific service logs
docker compose logs -f backend
docker compose logs -f frontend

# Restart a service
docker compose restart backend

Kubernetes Commands

# Get cluster credentials
gcloud container clusters get-credentials <cluster-name> --zone us-central1-a --project <project-id>

# View all resources
kubectl get all -n datadetox-namespace

# View pods
kubectl get pods -n datadetox-namespace

# View logs
kubectl logs -f deployment/backend -n datadetox-namespace

# Scale deployments
kubectl scale deployment/backend --replicas=3 -n datadetox-namespace

# Delete deployment
kubectl delete deployment backend -n datadetox-namespace

Services Overview

The application consists of the following services:

1. Backend (Port 8000)

FastAPI application
Handles search agent queries
Connects to Neo4j and HuggingFace
Provides REST API endpoints

2. Frontend (Port 3000)

React + Vite application
User interface for the chatbot
Pre-configured to connect to backend at http://localhost:8000

3. Neo4j (Ports 7474, 7687) - Optional

Graph database for model lineage
Browser UI at http://localhost:7474
Can use cloud Neo4j instead (configure in .env)

4. Model Lineage Scraper - Optional

Scrapes HuggingFace model relationships
Populates Neo4j database
Run via: docker compose run model-lineage-scraper uv run python lineage_scraper.py --full

CI/CD and Testing

Check that CI/CD tests pass in the GitHub Actions tab
View coverage reports in .github/workflows/ or the coverage.xml files

Run tests locally:

# Backend tests
cd backend && uv run pytest

# Model-lineage tests
cd model-lineage && uv run pytest

Known Issues and Limitations

Current Limitations

Neo4j Database Population
- The Neo4j database must be manually populated using the model-lineage scraper
- Large-scale scraping (full HuggingFace catalog) can take several hours
- The database starts empty in new deployments (intentional to avoid overwriting existing database and wasting time on redeployment)
Dataset Extraction from arXiv
- Dataset extraction relies on parsing PDF content, which may not always be accurate
- Some papers may not contain explicit dataset information
- LLM-based extraction requires OpenAI API access and incurs API costs
Model Coverage
- Only models available on HuggingFace Hub are searchable
- Models not in the Neo4j database will not show lineage relationships
- Limited to publicly available model information
Rate Limiting
- HuggingFace API has rate limits that may affect large batch operations
- OpenAI API rate limits apply to LLM-based extraction
- Neo4j query performance depends on graph size
Scalability
- The current deployment uses a single Neo4j instance (not clustered)
- Large graph queries may be slow on very large datasets
- Frontend and backend scale horizontally, but Neo4j is a single point of failure
Error Handling
- Some external API failures may not be gracefully handled
- Network timeouts may cause incomplete results
- PDF parsing failures are logged but may not provide user feedback

Known Issues

Environment Variables
- Missing environment variables may cause silent failures
- Ensure all required API keys are set in .env or Kubernetes secrets
Docker Compose Networking
- Services must be started in the correct order (Neo4j before backend)
- Port conflicts may occur if services are already running
Kubernetes Deployment
- First-time deployment requires manual service account setup
- Pulumi state is stored in GCS, ensure bucket permissions are correct
- SSL/TLS setup is optional and requires additional configuration
Test Coverage
- Some async operations have limited unit test coverage (see test-coverage.md)
- Integration tests require proper mocking of external services
- CI/CD tests may fail if external API keys are not configured

Workarounds

For Neo4j Population: Use the model-lineage scraper with --limit flag to populate a subset of models first
For API Rate Limits: Implement retry logic with exponential backoff (partially implemented)
For PDF Parsing Failures: The system falls back to LLM-based extraction when PDF parsing fails
For Missing Lineage Data: The system will still provide HuggingFace search results even without Neo4j data

Future Improvements

Add caching layer for frequently accessed models to reduce latency
Improve PDF parsing accuracy with better extraction algorithms
Add support for more model repositories beyond HuggingFace
Add monitor to proactively look out for risks associated with datasets and models
Enhance neo4j scalability with neo4j clustering
Add comprehensive error handling and user feedback
Add monitoring and alerting for production deployments

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.dvc		.dvc
.github/workflows		.github/workflows
assets		assets
backend		backend
data/model-lineage/raw		data/model-lineage/raw
deployment		deployment
docs		docs
frontend		frontend
model-lineage		model-lineage
neo4j		neo4j
secrets		secrets
.dvcignore		.dvcignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
DOCKER_SETUP.md		DOCKER_SETUP.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

DataDetox - AC215 Project

Project Description

Documentation (in /docs/)

Code and Configuration

CI Evidence

Visualizations

Prerequisites

Required Software

For Production Deployment (Kubernetes)

API Keys and Credentials

GCP Service Accounts (for Production Deployment)

Environment Configuration

Setup Instructions

Option 1: Docker Compose (Recommended for Local Development)

Option 2: Local Development (Individual Services)

Model Lineage Setup

Backend Setup

Frontend Setup

Deployment Instructions

Production Deployment to Kubernetes (GCP)

Prerequisites for Deployment

Step 1: Build and Push Docker Images

Step 2: Deploy to Kubernetes Cluster

Step 3: Populate Neo4j Database (Optional)

Automated Deployment via CI/CD

Scaling the Application

Usage Guidelines

Running the Full-Stack Application Locally

Example Queries

Common Commands

Docker Compose Commands

Kubernetes Commands

Services Overview

1. Backend (Port 8000)

2. Frontend (Port 3000)

3. Neo4j (Ports 7474, 7687) - Optional

4. Model Lineage Scraper - Optional

CI/CD and Testing

Known Issues and Limitations

Current Limitations

Known Issues

Workarounds

Future Improvements

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Documentation (in `/docs/`)

Packages