Data Enrichment System

An AI-powered, multi-agent pipeline that automatically discovers, scrapes, geocodes, validates, and persists comprehensive school data into a Supabase database — orchestrated entirely by CrewAI agents connected to your database via a live MCP server.

Overview

The Data Enrichment System solves a critical data completeness problem: school databases often have missing fields such as addresses, enrollment numbers, phone numbers, and GPS coordinates. This system automates filling those gaps by:

Querying a Supabase database for schools with incomplete records
Scraping public school review websites for the missing details
Geocoding school addresses into precise latitude/longitude coordinates
Validating all data against strict schema rules
Writing the enriched, validated records back to Supabase

The entire pipeline is orchestrated by four specialised CrewAI agents that pass context between themselves in a sequential workflow. The system supports both live mode (real web scraping + real database writes) and mock mode (deterministic test data) for safe development and testing.

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        ENTRY POINTS                                 │
│  run_batch_schools.py  │  continuous_processing.py  │  main.py      │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │     CrewAI Orchestrator  │
                    │        (crew.py)         │
                    └────────────┬────────────┘
                                 │  Sequential Process
          ┌──────────────────────┼───────────────────────┐
          │                      │                       │
   ┌──────▼──────┐      ┌────────▼────────┐    ┌────────▼────────┐    ┌────────────────┐
   │  Researcher  │─────▶│    Scraper      │───▶│   Geocoder      │───▶│    Reporter    │
   │    Agent     │      │    Agent        │    │    Agent        │    │    Agent       │
   └──────┬───────┘      └────────┬────────┘    └────────┬────────┘    └───────┬────────┘
          │                       │                      │                     │
   ┌──────▼───────┐      ┌────────▼────────┐    ┌────────▼────────┐    ┌───────▼────────┐
   │ SupabaseTool  │      │  ScrapingTool   │    │ GeocodingTool   │    │ SupabaseTool   │
   │ get_schools   │      │ scrape_private  │    │    geocode      │    │ update_school  │
   └──────┬───────┘      │ scrape_public   │    └────────┬────────┘    └───────┬────────┘
          │              └────────┬────────┘             │                     │
          │                       │                      │                     │
   ┌──────▼───────────────────────▼──────────────────────▼─────────────────────▼────────┐
   │                         SUPABASE DATABASE                                          │
   │                    (via MCP Server + Direct SDK)                                   │
   └─────────────────────────────────────────────────────────────────────────────────────┘

CrewAI Agent Pipeline

The pipeline runs sequentially — each agent receives the full output of all previous agents as context before acting. This allows the Reporter to compile and validate data from all three prior stages without any extra retrieval calls.

1. Researcher Agent

Property	Detail
Role	School Data Senior Data Researcher
Goal	Identify schools with incomplete records in Supabase
Tool	`SupabaseTool` → `action: get_schools`

What it does:

Calls the Supabase database to retrieve a configurable batch of schools
Inspects each record for missing fields: address, city, zip, total_student_enrollment, latitude, longitude
Validates that every school_id is a proper UUID before including it
Produces a structured JSON list of schools with their missing fields for downstream agents
Automatically retries with a smaller limit if the database call fails

Output format:

[
  {
    "school_id": "ee8981ae-7f29-47bf-968c-4829381e0559",
    "school_name": "SACRED HEART CATHOLIC HIGH SCHOOL",
    "missing_fields": ["total_student_enrollment", "latitude", "longitude"]
  }
]

2. Scraper Agent

Property	Detail
Role	School Data Scraper
Goal	Enrich school data from PrivateSchoolReview and PublicSchoolReview
Tool	`ScrapingTool` → `action: scrape_private` or `scrape_public`

What it does:

Classifies each school as private or public by scanning the school name for religious terms (Catholic, Christian, Lutheran, Baptist, Episcopal, Sacred Heart, etc.)
Routes to the correct scraping action:
- Private schools → privateschoolreview.com
- Public schools → publicschoolreview.com
Fetches and parses school pages using BeautifulSoup with rotating User-Agent headers and randomised delays (2–5 seconds) to avoid rate limiting
Retries up to 3 times with exponential backoff (tenacity) — on each retry it progressively simplifies the school name (removes special characters, tries abbreviations)
Extracts the following fields: total_student_enrollment, address, city, zip, phone, school_type, religious_orientation, days_in_school_year

Output format:

[
  {
    "school_id": "ee8981ae-7f29-47bf-968c-4829381e0559",
    "enriched_fields": {
      "total_student_enrollment": 500,
      "address": "123 Main St",
      "city": "Austin",
      "zip": 78701,
      "phone": "(512) 555-1234",
      "school_type": "REGULAR ELEMENTARY OR SECONDARY",
      "religious_orientation": "Christian"
    },
    "status": "success"
  }
]

3. Geocoder Agent

Property	Detail
Role	School Geocoding Specialist
Goal	Add precise GPS coordinates to each school record
Tool	`GeocodingTool` → `action: geocode`

What it does:

Accepts address components (address, city, state, zip) or a full location string
Calls Nominatim (OpenStreetMap) via geopy with a built-in RateLimiter (min 1 second between requests)
Falls back gracefully: if geocoding a full address fails, retries with city + state only; for PO Box addresses, falls back to city + state immediately
Validates results: checks that returned coordinates fall within the continental US bounding box (lat: 24–50, lng: -125 to -65)
Supports mock mode for testing — generates deterministic coordinates from an MD5 hash of the location string

Output format:

[
  {
    "school_id": "ee8981ae-7f29-47bf-968c-4829381e0559",
    "enriched_fields": {
      "latitude": 30.267153,
      "longitude": -97.743057
    },
    "status": "success"
  }
]

4. Reporter Agent

Property	Detail
Role	School Data Quality Specialist
Goal	Validate, compile, and persist all enriched data to Supabase
Tool	`SupabaseTool` → `action: update_school`

What it does:

Reads the full context from all three prior agents without any additional tool calls
Runs a strict data validation pass before writing anything:
- total_student_enrollment: must be an integer between 10 and 5,000
- latitude: must be a float between 24.0 and 50.0
- longitude: must be a float between -125.0 and -66.0
- phone: must match format (XXX) XXX-XXXX
- zip: must be a valid 5-digit (or ZIP+4) US postal code
Updates each school one at a time (with a 1-second pause between writes) for safe error isolation
Retries a failed write once after a 2-second wait before logging the failure and continuing
Produces a structured Markdown summary report with counts of successes and failures

Custom Tools

SupabaseTool

File: src/dbenc/tools/supabase_tool.py

A LangChain BaseTool that wraps the Supabase Python SDK and exposes a unified action-based interface to the CrewAI agents.

Action	Description
`get_schools`	Fetches schools needing enrichment; flags fields that are `null` or missing
`update_school`	Updates a single school by UUID with validated field data
`get_all_schools`	Retrieves all schools (used for monitoring/reporting)
`query`	Flexible filtered query against any Supabase table
`test`	Pings the database to verify connectivity
`initialize`	Seeds the database with a sample school record if empty

Connects using SUPABASE_URL + SUPABASE_ANON_KEY from the environment. Compatible with both positional (CrewAI 0.28.0) and keyword argument calling conventions.

ScrapingTool

File: src/dbenc/tools/scraping_tool.py

A LangChain BaseTool that scrapes school data from two public review websites using requests + BeautifulSoup.

Action	Target Site	School Type
`scrape_private`	privateschoolreview.com	Religious / independent schools
`scrape_public`	publicschoolreview.com	Public / state schools

Key implementation details:

Rotating User-Agents — cycles through 3 different browser signatures per request
Random delays — 2–5 second sleep between each HTTP request
Exponential backoff — via tenacity (stop_after_attempt(3), wait_exponential(min=2, max=10))
Mock mode — returns realistic deterministic data without making any HTTP calls; activated by --use_mock flag

GeocodingTool

File: src/dbenc/tools/geocoding_tool.py

A LangChain BaseTool that converts school addresses into GPS coordinates using Nominatim (OpenStreetMap) via geopy.

Key implementation details:

Rate-limited — geopy.extra.rate_limiter.RateLimiter enforces ≥1 second between requests
Smart fallback — automatically strips address components on retry (full address → city + state)
3-attempt retry loop with progressive address simplification
Mock mode — generates deterministic lat/lng from an MD5 hash of the input string, always within continental US bounds

Supabase MCP Server Integration

This project integrates with Supabase via the Model Context Protocol (MCP) — a standard that lets AI agents interact with external services through a structured server interface.

The MCP server exposes Supabase operations (read, write, query) directly to AI tooling without requiring custom API wrappers. It runs as a local Node.js process.

How it works

CrewAI Agent
     │
     │  JSON action call
     ▼
SupabaseTool (_run method)
     │
     │  Supabase Python SDK
     ▼
Supabase REST API ◄──── MCP Server (npx @supabase/mcp-server-supabase)
     │
     ▼
Supabase PostgreSQL Database

The MCP server is authenticated with your Supabase Personal Access Token (not the anon key), giving it elevated privileges for admin-level operations. This token must never be committed — it belongs only in your .env file.

Starting the MCP server

npx -y @supabase/mcp-server-supabase@latest --access-token=YOUR_SUPABASE_PAT

You can verify the connection using the included test file:

# Open in browser to test MCP connectivity
start mcp_test.html

Data Fields Enriched

Field	Type	Source	Validation
`address`	string	ScrapingTool	—
`city`	string	ScrapingTool	—
`zip`	int/string	ScrapingTool	US ZIP format
`phone`	string	ScrapingTool	`(XXX) XXX-XXXX`
`total_student_enrollment`	int	ScrapingTool	10 – 5,000
`school_type`	string	ScrapingTool	—
`religious_orientation`	string	ScrapingTool	Private schools only
`days_in_school_year`	int	ScrapingTool	—
`latitude`	float	GeocodingTool	24.0 – 50.0 (continental US)
`longitude`	float	GeocodingTool	-125.0 – -66.0 (continental US)

Installation

Prerequisites

Python >=3.10, <3.13
Node.js >=18
A Supabase project with a schools table

Python dependencies

This project uses UV for dependency management:

pip install uv
crewai install

Node.js dependencies (MCP server)

npm install @supabase/mcp-server-supabase punycode2

Configuration

Copy your credentials into a .env file (never commit this file):

OPENAI_API_KEY=sk-...
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_ANON_KEY=eyJ...
SUPABASE_ACCESS_TOKEN=sbp_...   # Personal Access Token for MCP server

To customise agent behaviour, edit these YAML files:

File	Purpose
`src/dbenc/config/agents.yaml`	Agent roles, goals, backstories, and tool instructions
`src/dbenc/config/tasks.yaml`	Task descriptions, expected output formats, and agent assignments

Running the Project

Step 1 — Start the MCP server

The MCP server must be running before you start processing schools with live data:

npx -y @supabase/mcp-server-supabase@latest --access-token=YOUR_SUPABASE_PAT

Step 2 — Prepare a batch of schools

Fetch unprocessed schools from Supabase and write them to a local JSON batch file:

python process_supabase_schools.py --batch_size 5

Step 3 — Run the CrewAI pipeline

# Single run, real data
python -m dbenc.main run --batch_size 5 --timeout 300

# Or use the optimised batch runner (recommended — manages context window)
python run_batch_schools.py --batch_size=2 --max_schools=10 --timeout=600

Parameter	Default	Recommended	Description
`--batch_size`	1	2–3	Schools per CrewAI run
`--max_schools`	10	10–50	Total schools to process
`--timeout`	300	600	Seconds per batch before timeout
`--use_mock`	off	—	Skip real scraping/geocoding (testing)

Step 4 — Automated continuous processing

To process schools in a fully automated loop across many batches:

python continuous_processing.py

Or run a fixed number of batches at once:

python src/batch_process.py --batch_size 5 --timeout 300 --batches 3

Step 5 — View results and monitor progress

# View all enriched schools in the database
python src/view_enriched_schools.py

# Monitor overall processing progress
python monitor_progress.py

# View which schools have already been processed
python process_supabase_schools.py --view-processed

Project Structure

dbenc/
├── src/
│   └── dbenc/
│       ├── config/
│       │   ├── agents.yaml          # Agent roles, goals, backstories & tool instructions
│       │   └── tasks.yaml           # Task descriptions, expected outputs & agent assignments
│       ├── tools/
│       │   ├── supabase_tool.py     # Supabase CRUD operations (get_schools, update_school, …)
│       │   ├── scraping_tool.py     # Web scraper (privateschoolreview / publicschoolreview)
│       │   └── geocoding_tool.py    # Nominatim geocoder with rate limiting & fallback
│       ├── crew.py                  # CrewAI Crew, Agent, Task wiring and execution
│       └── main.py                  # CLI entry point and batch orchestration
├── src/
│   ├── batch_process.py             # Multi-batch automation helper
│   ├── update_db_schools.py         # Manually push a repaired JSON file to Supabase
│   ├── view_enriched_schools.py     # Display enriched school records
│   ├── extract_school_data.py       # Extract + repair agent output JSON
│   ├── error_handling.py            # Shared retry and error utilities
│   └── get_schools_for_processing.py
├── school_output/                   # All school JSON output files (auto-created)
│   ├── results_*.json               # Raw CrewAI agent output per run
│   ├── batch_schools_*.json         # Batch processing output files
│   ├── real_school_*.json           # Single real-school run output
│   ├── single_school_*.json         # Single school processing output
│   ├── schools_to_process_*.json    # Batches prepared from Supabase
│   ├── temp_batch_*.json            # Temporary batch files (auto-cleaned)
│   └── processed_schools.json       # Deduplication tracker
├── repair_output/                   # Repaired/cleaned school JSON (auto-created)
│   └── repaired_school_updates_*.json  # Validated payloads ready for DB upload
├── docs/
│   ├── comprehensive_guide.md       # Full guide with code examples
│   ├── quick_start_guide.md         # Quick reference
│   ├── school_data_enrichment_workflow.md
│   ├── supabase_mcp_integration.md  # MCP server setup deep-dive
│   ├── supabase_mcp_prompts.md      # Example MCP prompts
│   ├── architecture_explanation.md  # System architecture deep-dive
│   ├── CHANGELOG.md                 # Change history
│   └── …other architecture docs
├── process_supabase_schools.py      # Fetch & prepare school batches from Supabase
├── run_batch_schools.py             # Optimised batch runner with context window management
├── continuous_processing.py         # Infinite processing loop for large datasets
├── monitor_progress.py              # Progress monitoring dashboard
├── mcp_test.html                    # Browser-based MCP connectivity tester
├── pyproject.toml                   # Python project metadata (UV)
├── package.json                     # Node.js dependencies (MCP server)
├── .env                             # ⚠️ Secret credentials — never commit
└── README.md

Troubleshooting

Agent & Pipeline Issues

Symptom	Fix
Context window exceeded	Reduce `--batch_size` to 1 or 2
API rate limits hit	Increase wait time between batches in `continuous_processing.py`
Geocoding returns wrong location	Check that the state code is a valid 2-letter US abbreviation
Scraping returns no results	Try removing special characters from the school name; check if the school type (public/private) is correctly detected
Reporter skips a school	Check validation: enrollment outside 10–5000, coordinates outside continental US, or invalid phone format

MCP Server Issues

Symptom	Fix
Connection refused	Confirm the MCP server process is still running
401 Unauthorized	Regenerate your Supabase Personal Access Token and update `.env`
Node.js errors	Run `npm install` and ensure Node.js `>=18` is installed
Supabase updates not appearing	Verify `SUPABASE_URL` and `SUPABASE_ANON_KEY` are correct

Reporter Agent DB Update Failures

Confirm school_id is a valid 36-character UUID string
Check that at least one field passes validation (invalid-only payloads are silently skipped)
Verify the Supabase table schema matches the field names exactly
Review logs — each validation failure is logged with the specific reason

Documentation

Comprehensive Guide — Code examples for live data, mock data, and batch processing
Workflow Documentation — End-to-end flow explanation
Quick Start Guide — Get running in minutes
Supabase MCP Integration — MCP server setup and usage
MCP Example Prompts — Ready-to-use prompts for MCP interactions

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Architecture		Architecture
docs		docs
knowledge		knowledge
run_scripts		run_scripts
scrapping		scrapping
src/dbenc		src/dbenc
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_workflow.sh		run_workflow.sh

Folders and files

Latest commit

History

Repository files navigation

Data Enrichment System

Table of Contents

Overview

System Architecture

CrewAI Agent Pipeline

1. Researcher Agent

2. Scraper Agent

3. Geocoder Agent

4. Reporter Agent

Custom Tools

SupabaseTool

ScrapingTool

GeocodingTool

Supabase MCP Server Integration

How it works

Starting the MCP server

Data Fields Enriched

Installation

Prerequisites

Python dependencies

Node.js dependencies (MCP server)

Configuration

Running the Project

Step 1 — Start the MCP server

Step 2 — Prepare a batch of schools

Step 3 — Run the CrewAI pipeline

Step 4 — Automated continuous processing

Step 5 — View results and monitor progress

Project Structure

Troubleshooting

Agent & Pipeline Issues

MCP Server Issues

Reporter Agent DB Update Failures

Documentation

External Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages