This is your day-to-day reference for working in this codebase. Repo structure, setup, code standards, where to put files, and how to submit work — it's all here.
git clone -b main https://github.com/saayam-for-all/data.git
cd data
python -m venv venv
source venv/bin/activate # macOS/Linux — or venv\Scripts\activate on Windows
cd data-engineering
pip install -r requirements.txt
cp .env.example .env # Fill in your environment variablesYou will not get AWS Lambda/S3 access. You develop and test locally with mock data. When your code works, let the team leads know — they handle AWS deployment. Structure your code so AWS calls can be easily mocked.
| Tech | Purpose |
|---|---|
| Python | Primary language |
| AWS Lambda | Serverless functions (team leads deploy — you don't get access) |
| AWS S3 | Data lake storage |
| PostgreSQL (Aurora) | Primary database |
| boto3 | AWS SDK for Python |
| SQLAlchemy | ORM for database interactions |
| pandas | Data cleaning and manipulation |
| Docker | Containerization |
data-engineering/
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── KNOWLEDGE_TRANSFER.md
├── README.md
├── TASK_TRACKER.md
├── requirements.txt
│
├── datasets/
│ ├── cleaned/
│ └── raw/
│
├── infrastructure/
│ ├── deployment.yaml
│ ├── docker-compose.yml
│ ├── Dockerfile
│ └── service.yaml
│
├── notebooks/
│ └── analytics_sql_and_visualizations.ipynb
│
├── scripts/
│ └── deploy/
│ ├── deploy_aggregator.sh
│ └── deploy_categorizer.sh
│
├── src/
│ ├── __init__.py
│ ├── app.py
│ ├── config.py
│ ├── extensions.py
│ ├── main.py
│ │
│ ├── aggregator/
│ │ ├── __init__.py
│ │ ├── db_client.py
│ │ ├── genai_client.py
│ │ ├── handler.py
│ │ ├── merger.py
│ │ └── requirements.txt
│ │
│ ├── categorizer/
│ │ ├── __init__.py
│ │ ├── categories.py
│ │ ├── classifier.py
│ │ ├── handler.py
│ │ └── requirements.txt
│ │
│ ├── models/
│ │ ├── __init__.py
│ │ └── fraud_requests.py
│ │
│ ├── scrapers/
│ │ ├── __init__.py
│ │ ├── emergency_contacts/
│ │ │ ├── __init__.py
│ │ │ ├── cleaner.py
│ │ │ ├── loader.py
│ │ │ └── scraper.py
│ │ └── ngo/
│ │ ├── __init__.py
│ │ ├── afghanistan.py
│ │ ├── india.py
│ │ └── malaysia.py
│ │
│ ├── translation/
│ │ ├── __init__.py
│ │ └── lang_detection.py
│ │
│ └── utils/
│ └── __init__.py
│
└── tests/
data-analytics/
├── README.md
└── notebooks/
Do not self-assign tasks. Do not edit issue descriptions or user stories. Task assignment and issue management are the responsibility of team leads and project managers. If you want to work on something, let us know in the team meeting or WhatsApp group and we will assign it to you.
<your_github_username>_<issue_number>_<brief_description>
Example: saquibb8_100_auto_categorize_requests
- Get assigned a task by a team lead or PM.
- Branch off
dev:git checkout -b <your_branch_name> - Develop and test locally with mock data.
- Commit with issue references:
git commit -m "#100: Add classification logic" - Push and create a PR targeting
dev(nevermain). Assign reviewers. - Address code review feedback. PRs need at least 2 reviews.
- Team lead merges after approval.
- ❌ Don't push directly to
mainordev. - ❌ Don't self-assign tasks or edit issue descriptions.
- ❌ Don't commit secrets, API keys, or AWS credentials.
- ❌ Don't disappear after being assigned a task.
- Python 3.10+, PEP 8, type hints where practical.
snake_casefor file names and functions,PascalCasefor class names.- Docstrings for all functions and classes.
- No credentials in code — use
.envfiles. - Never commit
__pycache__/,venv/,.env, or IDE files. - Update
requirements.txtif you add dependencies.
All Lambda functions live under src/. Create a new folder for your Lambda:
src/
└── your_lambda_name/
├── __init__.py # Required - makes it a Python package
├── lambda_function.py # Entry point (must have lambda_handler function)
├── helpers.py # Supporting code (optional)
└── requirements.txt # Lambda-specific dependencies (lightweight only)
Steps:
- Create folder:
mkdir src/your_lambda_name - Add
__init__.py:touch src/your_lambda_name/__init__.py - Create
lambda_function.pywith alambda_handler(event, context)function - Add a
requirements.txtwith lightweight dependencies (heavy packages like pandas should be Lambda Layers) - Push via PR → auto-deploys on merge to main
Reference: See src/saayam-org-aggregator/ for a complete working example.
Scrapers go under src/scrapers/. Choose the appropriate subfolder:
| Scraper Type | Location |
|---|---|
| Emergency contacts | src/scrapers/emergency_contacts/ |
| NGO data | src/scrapers/ngo/ |
| New category | Create src/scrapers/your_category/ |
Naming convention: Use lowercase with underscores (e.g., united_states.py, not UnitedStates.py)
Steps for a new scraper:
- Add your scraper file:
src/scrapers/ngo/country_name.py - If creating a new category:
mkdir src/scrapers/your_category touch src/scrapers/your_category/__init__.py - Output raw data to
datasets/raw/ - Output cleaned data to
datasets/cleaned/
Never commit large data files to git. The datasets/ folder is gitignored.
| Data State | Location |
|---|---|
| Raw/unprocessed | datasets/raw/ |
| Cleaned/processed | datasets/cleaned/ |
Use relative paths in your code:
import os
PROJECT_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
raw_path = os.path.join(PROJECT_ROOT, 'datasets', 'raw', 'your_file.csv')Database models go in src/models/:
src/models/
├── __init__.py
├── fraud_requests.py # Existing model
└── your_model.py # Add new models here
Heads up: We're planning to migrate from PostgreSQL to a vector database (Redis, Pinecone, etc.) for certain use cases.
Current database code:
- Connection helpers:
src/utils/(adddb_client.pyhere) - ORM models:
src/models/
When adding database code:
- Don't scatter database connections across files
- Do use
src/utils/db_client.pyfor all DB connections - Do abstract your queries so they can be swapped out later
Example pattern:
# src/utils/db_client.py
class DatabaseClient:
"""Abstraction layer for database operations.
When we migrate to Redis/vector store, we only need to
change this file, not every file that uses the DB.
"""
def __init__(self):
# Currently PostgreSQL
self.conn = get_postgres_connection()
def search_similar(self, query_vector):
# TODO: Replace with vector store search
passReusable helpers (AWS clients, logging, config) go in src/utils/:
src/utils/
├── __init__.py
├── db_client.py # Database connections
├── aws_client.py # AWS SDK helpers
├── genai_client.py # OpenAI/Gemini wrappers
└── logger.py # Logging setup
| Task | Create files in |
|---|---|
| New Lambda function | src/your_lambda/ |
| New scraper (existing category) | src/scrapers/category/ |
| New scraper category | src/scrapers/new_category/ |
| Database model | src/models/ |
| Shared utility | src/utils/ |
| Jupyter notebook | notebooks/ |
| Test file | tests/ |
| Deploy script | scripts/deploy/ |
| Docker/K8s config | infrastructure/ |
Last updated: February 2026