A production-grade Scope 3 carbon emissions intelligence platform processing 10M+ supplier shipment records, attributing carbon to the SKU level using EPA v1.4.0 emission factors, scoring supplier risk with LightGBM, and surfacing real-time anomalies via WebSocket.
╔══════════════════════════════════════════════════════════════════╗
║ INGESTION — 100 events/sec ║
║ ║
║ Shipment Producer ──▶ Kafka ──▶ PySpark Bronze ──▶ S3/MinIO ║
║ (US Census- market Structured Delta Lake ║
║ calibrated) .shipments Streaming raw layer ║
║ │ ║
║ PROCESSING ▼ ║
║ PySpark Silver ║
║ EPA v1.4.0 calc ║
║ emissions_kg_co2e ║
║ │ ║
║ WAREHOUSE + ORCHESTRATION ▼ ║
║ PostgreSQL (Neon) ║
║ Gold layer tables ║
║ │ ║
║ Airflow DAG ──▶ LightGBM ║
║ 15-min schedule risk scoring ║
║ ║
║ REAL-TIME ANOMALY DETECTION ║
║ ║
║ Kafka ──▶ Z-score detector ──▶ emissions_alerts ──▶ WebSocket ║
║ stream rolling 30-event PostgreSQL sub-second ║
║ window per supplier broadcast ║
║ ║
║ SERVING ║
║ ║
║ FastAPI ──▶ React + Mapbox ──▶ verdant.nikhilgiridharan.com ║
║ 15 REST interactive NL query · Scenarios ║
║ endpoints world map PDF ESG report ║
╚══════════════════════════════════════════════════════════════════╝
The EPA publishes Scope 3 factors in kg CO₂e per USD spent on transport services. Verdant stores physical shipment data (weight_kg, distance_km), requiring a derivation step:
Formula:
kg_co2e_per_tonne_km = epa_factor_per_usd × cost_usd_per_tonne_km
Industry-average cost rates (2024):
AIR: $0.85/tonne-km → 0.644 × 0.85 = 0.5474 kg CO₂e/tonne-km
OCEAN: $0.04/tonne-km → 0.583 × 0.04 = 0.0233 kg CO₂e/tonne-km
TRUCK: $0.12/tonne-km → 0.767 × 0.12 = 0.0920 kg CO₂e/tonne-km
RAIL: $0.05/tonne-km → 0.154 × 0.05 = 0.0077 kg CO₂e/tonne-km
Emissions per shipment:
emissions_kg = (weight_kg / 1000) × distance_km × kg_co2e_per_tonne_km
Every fact_emissions row carries emissions_factor_version = 'v1.4.0' for full audit trail. When EPA releases v1.5.0, a single seed file update recalibrates all calculations.
A dedicated Kafka consumer (anomaly_stream.py) reads from the same shipment-events topic as the batch pipeline but operates independently. For each event:
For each supplier, maintain a deque(maxlen=30) of recent emissions values.
When a new event arrives:
z_score = (emissions_kg - mean(window)) / stdev(window)
if z_score > 2.5: flag as SPIKE (CRITICAL if z_score > 4.0)
if value > mean × 3: flag as THRESHOLD_BREACH
if intensity > category_avg × 2: flag as ANOMALY
On detection:
1. INSERT into emissions_alerts (PostgreSQL) — persisted
2. Broadcast to all WebSocket clients — sub-second delivery
The consumer never blocks the batch pipeline — two independent consumers, same topic, different consumer groups.
No public database of real supplier-to-SKU shipment records exists. The producer generates synthetic events calibrated to 2024 US Census Bureau international trade data (api.census.gov — no login required):
Key calibration findings vs naive assumptions:
Mexico: 16.2% of US imports (#2 source, was assumed 10%)
Canada: 9.8% of US imports (#3 source, was missing entirely)
Ocean: 72.8% of volume (was underweighted)
Electronics by air: 35% (high value/weight ratio — was assumed 20%)
Seasonal factors from Census monthly data:
November: 1.20× baseline (pre-holiday inventory)
February: 0.78× baseline (Chinese New Year disruption)
Features engineered from shipment_silver_summary per supplier:
features = [
'emissions_30d_kg', # total last 30 days
'emissions_90d_kg', # total last 90 days
'air_pct', # % of shipments by air
'ocean_pct', # % of shipments by ocean
'truck_pct', # % of shipments by truck
'rail_pct', # % of shipments by rail
'avg_carbon_intensity', # kg CO₂e per kg shipped
'weight_volatility', # std dev of shipment weight
'shipment_count_30d', # volume proxy
]
# Labels from emissions percentiles:
# CRITICAL: >90th HIGH: 75-90th MEDIUM: 50-75th LOW: <50thLightGBM was chosen over XGBoost for native categorical support (country, transport mode) and 3× faster training on tabular data. Model artifacts are versioned — every supplier_risk_scores row carries model_version = 'lgbm-1.0'. Risk scores refresh every 15 minutes via Airflow.
Users ask questions in plain English. Claude API converts them to SQL, executes against Neon, and generates a one-sentence insight:
User: "Which suppliers in China are getting worse this month?"
Claude API → SQL:
SELECT s.name, r.risk_tier,
ROUND(r.emissions_30d_kg, 0) as emissions_30d_kg,
r.emissions_trend
FROM supplier_risk_scores r
JOIN suppliers s ON r.supplier_id = s.supplier_id
WHERE s.country = 'CN'
AND r.emissions_trend = 'WORSENING'
ORDER BY r.emissions_30d_kg DESC NULLS LAST
LIMIT 10
Claude API → Insight:
"8 Chinese suppliers are worsening, led by Foxconn Vietnam
with 2,847 kg CO₂e in the last 30 days — 34% above their
90-day baseline."
Safety: only SELECT queries are permitted. All non-SELECT strings are rejected before execution.
Given a target reduction percentage, the API calculates the minimum set of transport mode switches needed:
# Greedy approach: sort switches by savings_kg descending,
# select until cumulative savings >= target
pathway = []
cumulative = 0
for supplier in air_suppliers.order_by('-mode_emissions'):
savings = calc_switch_savings(supplier, 'AIR', 'OCEAN')
pathway.append({**supplier, 'savings_kg': savings})
cumulative += savings
if cumulative >= target_savings_kg:
break
# Returns ordered action list with cumulative impact at each step| Metric | Result |
|---|---|
| Pipeline throughput | 100 events/sec (configurable via PRODUCER_RATE) |
| Local pipeline scale | 10M+ shipment records |
| Live database | 50,000 pre-aggregated records (Neon free tier) |
| Anomaly detection latency | Sub-second (Kafka → WebSocket) |
| API cold start (Render free) | ~30 seconds after 15-min idle |
| API response time (warm) | <200ms p99 |
| Risk scoring cycle | Every 15 minutes via Airflow |
| Suppliers tracked | 500 across 9 countries |
| SKUs attributed | 2,000 across 10 categories |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/emissions/summary |
Total CO₂ YTD, MTD, supplier count |
| GET | /api/v1/emissions/timeseries |
Daily/weekly/monthly trend data |
| GET | /api/v1/emissions/by-transport-mode |
Split by AIR/OCEAN/TRUCK/RAIL |
| GET | /api/v1/emissions/by-country |
Aggregated by supplier origin country |
| GET | /api/v1/emissions/decarbonization-pathway |
Greedy pathway to reduction target |
| GET | /api/v1/suppliers |
Paginated supplier list with risk scores |
| GET | /api/v1/suppliers/map-data |
Optimized for Mapbox GL rendering |
| GET | /api/v1/suppliers/{id} |
Full supplier profile |
| GET | /api/v1/suppliers/benchmarks |
Intensity vs category average |
| GET | /api/v1/skus |
SKU list with emissions summary |
| GET | /api/v1/skus/{id}/sankey |
Sankey diagram data (supplier → mode → SKU) |
| GET | /api/v1/forecasts/supplier/{id} |
30/60/90-day emissions forecast |
| POST | /api/v1/nl/query |
Natural language → SQL → result |
| GET | /api/v1/pipeline/status |
All component health + Kafka lag |
| GET | /api/v1/report/generate |
Download Scope 3 PDF report |
| WS | /ws/alerts |
Real-time anomaly alert stream |
| WS | /ws/pipeline |
Pipeline status stream |
| Layer | Technology | Why |
|---|---|---|
| Ingestion | Apache Kafka (MSK-compatible) | Decouples producers from consumers, enables replay |
| Processing | PySpark Structured Streaming | Glue-compatible, handles 10M+ records |
| Storage | PostgreSQL (Neon serverless) | Zero-cost, scale-to-zero, production Postgres |
| Orchestration | Apache Airflow | Industry standard, tested in FAANG interviews |
| ML — risk scoring | LightGBM | Native categorical support, 3× faster than XGBoost on tabular |
| ML — forecasting | XGBoost | Point forecast + prediction interval |
| ML — tracking | MLflow | Experiment versioning, artifact storage |
| NL Query | Claude API (claude-sonnet) | SQL generation + insight summarization |
| Emission factors | EPA v1.4.0 NAICS-6 | Official government source, October 2025 |
| API | FastAPI | Async, auto OpenAPI, Pydantic validation |
| Dashboard | React + Vite + Mapbox GL JS | Interactive globe, supplier nodes |
| Charts | Recharts + D3 Sankey | Trend lines, SKU attribution diagram |
| Containerization | Docker Compose | One-command local stack |
| Deployment | Neon + Render + Vercel | $0/month, no credit card |
| Source | Description | Access |
|---|---|---|
| EPA v1.4.0 | Supply Chain GHG Emission Factors (Oct 2025) | Free download |
| US Census Bureau | 2024 international trade flow calibration | Free public API |
| Synthetic shipments | 10M+ records, US Census-calibrated distributions | Generated by producer |
| Real supplier names | Publicly disclosed supplier lists (Apple, auto OEMs) | Public sustainability reports |
Visit verdant.nikhilgiridharan.com — no setup required.
Note: The Render API may take ~30 seconds to wake from idle on first request (free tier). The dashboard will show data once the API responds.
Prerequisites:
- Docker + Docker Compose
- Python 3.11+
- Node.js 18+
- Mapbox account (free token at mapbox.com)
# Clone
git clone https://github.com/nikhilgiridharan/verdant.git
cd verdant
# Configure environment
cp .env.production.example .env
# Required: DATABASE_URL, VITE_MAPBOX_TOKEN
# Optional: ANTHROPIC_API_KEY (NL query), RESEND_API_KEY (email digest)
# Start full local stack (Kafka, Spark, Airflow, MLflow, API)
make up
# Seed the database (creates schema + 500 suppliers + 50k shipments)
DATABASE_URL=<your_neon_url> python3 scripts/seed_neon.py
# Start dashboard
cd dashboard && npm install && npm run dev
# Dashboard: http://localhost:3000
# API docs: http://localhost:8000/docs
# Airflow: http://localhost:8080
# MLflow: http://localhost:5000make up # Start all services
make down # Stop all services
make logs # Tail all logs
make suppliers # Regenerate supplier coordinates
make dbt-run # Run dbt models (if Snowflake configured)
make test # Run test suite
make clean # Stop + remove volumes| Variable | Description | Required |
|---|---|---|
DATABASE_URL |
Neon PostgreSQL connection string | Yes |
VITE_API_BASE_URL |
FastAPI backend URL | Yes |
VITE_MAPBOX_TOKEN |
Mapbox GL JS public token | Yes |
KAFKA_BOOTSTRAP_SERVERS |
Kafka broker (local: localhost:29092) |
Local only |
ANTHROPIC_API_KEY |
Claude API for NL query layer | Optional |
RESEND_API_KEY |
Resend for weekly email digest | Optional |
PRODUCER_RATE |
Events/second (default: 100) | Optional |
See .env.production.example for full reference.
verdant/
├── ingestion/
│ └── producer/
│ ├── shipment_producer.py # Kafka event producer, 100 events/sec
│ ├── generate_supplier_coords.py # City-anchored supplier coordinates
│ └── trade_calibration.py # US Census 2024 distributions
├── processing/
│ ├── spark_processor.py # PySpark Bronze: Kafka → Delta Lake
│ └── silver_transformer.py # PySpark Silver: EPA emissions calc
├── ml/
│ ├── train_risk_model.py # LightGBM supplier risk classifier
│ ├── anomaly_stream.py # Real-time Kafka anomaly detector
│ └── models/ # Trained model artifacts
├── api/
│ ├── main.py # FastAPI app, WebSocket endpoints
│ ├── routers/
│ │ ├── emissions.py # Emissions endpoints + pathway
│ │ ├── suppliers.py # Supplier endpoints + benchmarks
│ │ ├── skus.py # SKU attribution + Sankey
│ │ ├── nl_query.py # Claude API NL → SQL
│ │ ├── report.py # PDF ESG report generation
│ │ └── pipeline.py # Health + alert endpoints
│ └── db/
│ └── connection.py # Neon connection pool
├── dashboard/
│ └── src/
│ ├── views/ # Overview, Suppliers, SKU Trace,
│ │ # Network, Scenarios, Forecast,
│ │ # AskVerdant, Introduction
│ └── components/
│ ├── map/ # Mapbox GL, supplier nodes, heatmap
│ ├── panels/ # Supplier intel, alert feed
│ └── charts/ # Recharts, D3 Sankey
├── infrastructure/
│ └── airflow/dags/
│ ├── carbonpulse_full_pipeline.py # 15-min orchestration DAG
│ └── score_suppliers.py # LightGBM scoring job
├── data/
│ └── seeds/
│ ├── suppliers.csv # 500 suppliers, city-anchored
│ └── epa_emission_factors.csv # EPA v1.4.0 NAICS-6 factors
├── scripts/
│ └── seed_neon.py # Database seeder
├── docs/
│ └── decisions/ # Architecture Decision Records
├── docker-compose.yml # Full local stack
└── Makefile # make up · make test · make clean
| ADR | Decision |
|---|---|
| ADR-001 | DuckDB for local dev, Neon PostgreSQL for production |
| ADR-002 | Kafka + Spark Structured Streaming for sub-minute freshness |
| ADR-003 | EPA v1.4.0 cost-to-tonne-km derivation methodology |
Built by three collaborators with deliberate ownership separation — mirroring how ESG platforms are built in production:
- Data Engineer — pipeline infrastructure, emissions calculations, API, anomaly detection, deployment
- Data Scientist — LightGBM risk scoring, XGBoost forecasting, MLflow experiment tracking
- Supply Chain Analyst — EPA factor validation, KPI framework, domain validation
Zero-cost production deployment:
| Service | Purpose | Cost |
|---|---|---|
| Neon | PostgreSQL (serverless, scale-to-zero) | Free forever |
| Render | FastAPI backend (750 hrs/month) | Free tier |
| Vercel | React dashboard (unlimited bandwidth) | Free forever |
Total monthly cost: $0.00
The Render free tier spins down after 15 minutes of inactivity (~30s cold start). UptimeRobot pings /health every 5 minutes to keep the API warm.
MIT