Knowledge Transfer Document — Data Engineering Team

Purpose: If the current team leads leave tomorrow, this document should allow the next person to pick up where they left off with zero handholding.

Audience: Team leads and senior contributors. For onboarding, see README.md. For dev workflow, see CONTRIBUTING.md. For task status, see TASK_TRACKER.md.

Architecture & Data Flow

Current Pipeline

External Data (Wikipedia, NGO sites)
        │ Web scraping (Python)
        ▼
┌──────────────┐    ┌──────────────────┐
│  Saayam DB   │    │  GenAI Lambda    │
│ (PostgreSQL) │    │ (More_Org_GenAI  │
│              │    │  _Py_v3126)      │
└──────┬───────┘    └────────┬─────────┘
       └────────┬────────────┘
                ▼
     saayam-org-aggregator Lambda
     (merges sources, deduplicates)
                │
                ▼
     API Gateway → webapp frontend

Future Pipeline (Rao's Directive)

PostgreSQL → S3 Data Lake → Vectorize → Vector DB → AI Agent

Not yet built. The team needs to produce a functional spec / design spec for this.

How the Data Repo Connects to Other Repos

Only 7 of 40+ repos are actively developed:

webapp (React) ←── api (API Gateway) ←── mobileapp
        │                  │
        ▼                  ▼
   volunteer          request
  (Java/Spring)    (Help requests)
        │                  │
        ▼                  ▼
      database (PostgreSQL)
              │
              ▼
    data (Python) ← YOU ARE HERE
              │
              ▼
       ai (Python/Flask — GenAI)

   devsecops — CI/CD, infra (all teams)

Cross-Team Dependencies

Team	Repo	How We Interact
GenAI / AI	ai	We invoke their Lambda. Future: we feed vectorized data to their agent.
Frontend	webapp	They consume our Lambda endpoints. #99 is a cross-team task.
Backend / API	api	They set up API Gateway routes to our Lambdas.
Database	database	We read from their DB. Coordinate for schema changes.
DevSecOps	devsecops	They manage AWS infra our Lambdas run on.
Product	prod	Defines what we build. MVP Pages wiki.

Technical Details of Completed Work

Organization Aggregator Lambda (#98)

Lambda that accepts a help request (subject, description, location, category) and fetches matching orgs from:

Saayam DB — registered orgs (tagged "verified")
GenAI Lambda — AI-suggested orgs (tagged "genai")

Merges, deduplicates (DB takes priority), returns unified list with graceful degradation.

Lambda: saayam-org-aggregator (us-east-1)

Input:

{
  "category": "Shelter",
  "subject": "Shelter",
  "description": "i need a place to stay",
  "location": "tampa"
}

Output:

{
  "statusCode": 200,
  "body": [
    {
      "name": "The Salvation Army Tampa",
      "location": "Tampa, FL",
      "contact": "(813) 223-1320",
      "email": "...",
      "web_url": "...",
      "mission": "...",
      "source": "..."
    }
  ]
}

Emergency Contact Data Pipeline

Scrapes emergency contact numbers from Wikipedia → cleans with pandas → inserts into PostgreSQL via SQLAlchemy.

Files: src/scrapers/emergency_contacts/ — scraper.py, cleaner.py, loader.py.

NGO Web Scrapers

Country-specific scrapers for nonprofit listings: src/scrapers/ngo/afghanistan.py, india.py, malaysia.py. Run independently, produce CSVs. Not yet in an automated pipeline.

Language Detection & Translation

src/translation/lang_detection.py — detects language with langdetect, translates to English with GoogleTranslator.

Fraud Detection Model (Schema Only)

src/models/fraud_requests.py — SQLAlchemy model for fraud requests. Schema defined, no active detection logic built.

Earlier Pipeline Work (#55-#67)

ETL architecture design (#57), Aurora schema for nonprofits (#56), AWS architecture doc (#55), IRS S3 Lambda (#60), Charity Navigator scraper (#62), IRS nonprofit categorization (#67). These informed the current pipeline but IRS data was later dropped (see Decision Log).

AWS Infrastructure

Service	Purpose	Access
Lambda	Serverless functions	Team leads only
S3	Data lake, datasets	Team leads only
Aurora PostgreSQL	Primary database	Team leads only
API Gateway	Routes to Lambdas	API/DevSecOps team

Invoking other Lambdas:

client = boto3.client('lambda', region_name='us-east-1')
response = client.invoke(FunctionName='More_Org_GenAI_Py_v3126', ...)

Decision Log

Date	Decision	Rationale
Feb 2026	Dropped IRS data from org-aggregator	Too noisy, not useful for matching. Aggregator uses Saayam DB + GenAI only.
Feb 2026	#99 is cross-team (data + webapp)	Lambda done. Frontend work remains — needs React-comfortable volunteer.
2025	Local-first development	AWS access cannot be given to all volunteers. Cost and security.
2025	Pair programming mandate	97% churn means no single person should own a task alone.
Apr 2025	Aurora PostgreSQL as primary DB	PostgreSQL compatible, managed AWS, scalable.

Known Issues & Technical Debt

No tests. No unit tests written yet.
No CI/CD. No GitHub Actions for automated testing/linting.
Stale issues (#80-90). Need triage — reassign or close.

Handoff Checklist

If you are leaving the team lead role:

Last updated: February 2026 · Maintained by: Data Engineering Team Leads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge Transfer Document — Data Engineering Team

Architecture & Data Flow

Current Pipeline

Future Pipeline (Rao's Directive)

How the Data Repo Connects to Other Repos

Cross-Team Dependencies

Technical Details of Completed Work

Organization Aggregator Lambda (#98)

Emergency Contact Data Pipeline

NGO Web Scrapers

Language Detection & Translation

Fraud Detection Model (Schema Only)

Earlier Pipeline Work (#55-#67)

AWS Infrastructure

Decision Log

Known Issues & Technical Debt

Handoff Checklist

FilesExpand file tree

KNOWLEDGE_TRANSFER.md

Latest commit

History

KNOWLEDGE_TRANSFER.md

File metadata and controls

Knowledge Transfer Document — Data Engineering Team

Architecture & Data Flow

Current Pipeline

Future Pipeline (Rao's Directive)

How the Data Repo Connects to Other Repos

Cross-Team Dependencies

Technical Details of Completed Work

Organization Aggregator Lambda (#98)

Emergency Contact Data Pipeline

NGO Web Scrapers

Language Detection & Translation

Fraud Detection Model (Schema Only)

Earlier Pipeline Work (#55-#67)

AWS Infrastructure

Decision Log

Known Issues & Technical Debt

Handoff Checklist