Notebook: 00_data_preparation.ipynb
Status: ✅ Complete
Output Dataset: aadhaar_demographics_cleaned.csv
This phase establishes the foundational dataset for the entire ADIEWS framework by cleaning, transforming, and enriching raw Aadhaar demographic update records. The preparation ensures data quality, consistency, and readiness for downstream analysis.
- Source Files: Multiple CSV files in
DemographicData/folder - Initial Records: 2,375,882 raw entries
- Date Range: March 2025 - January 2026 (10 months)
- Geographic Scope: 1,056 districts across India
Columns:
- date: Update date (YYYY-MM-DD format)
- pincode: 6-digit area code
- district: District name
- state: State/UT name
- child_updates: Updates for ages 5-17
- adult_updates: Updates for ages 17+
- Approach: Complete case analysis (remove rows with any null values)
- Rationale: Geographic identifiers (pincode, district, state) are critical for spatial analysis
- Result: 100% complete records retained
- Conversion: String dates →
datetime64[ns]type - Validation: Ensure all dates fall within expected range
- Temporal Features: Extract month, year for aggregation
- Pincode Format: Verify 6-digit numeric codes
- District Names: Standardize spelling and formatting
- State Mapping: Validate against known state list
- Deduplication: Remove duplicate pincode-district-date combinations
- Non-Negative Constraint: Ensure child_updates ≥ 0, adult_updates ≥ 0
- Outlier Detection: Flag extreme values (>99.9th percentile)
- Data Type: Convert to
int64for update counts
total_updates = child_updates + adult_updatesRepresents total demographic activity per record.
child_share_pct = (child_updates / total_updates) * 100Indicates proportion of updates involving children (ages 5-17).
age_ratio = adult_updates / (child_updates + 1) # +1 to avoid division by zeroMeasures adult-to-child update intensity.
month = date.month
year = date.year
month_name = date.strftime('%Y-%m')Enables time-series and seasonal analysis.
Grouping: [district, state, month]
Aggregations:
- child_updates: sum, mean, std
- adult_updates: sum, mean, std
- total_updates: sum
- child_share_pct: mean
- pincode_count: unique countRecords: 10,560 (1,056 districts × 10 months)
Grouping: [state, month]
Aggregations:
- child_updates: sum
- adult_updates: sum
- district_count: unique countRecords: 630 (63 states × 10 months)
Grouping: [pincode, district, state, month]
Aggregations:
- child_updates: sum
- adult_updates: sum
- child_share_pct: meanRecords: ~200,000 pincode-month combinations
| Metric | Value |
|---|---|
| Total Records | 2,375,882 |
| Complete Records | 2,375,882 (100%) |
| Missing Values | 0 |
| Duplicate Records | 0 |
Child Updates (ages 5-17):
Mean: 1.9 per record
Median: 0 (44% are zero)
Max: 2,690
Total: 4.5M updates
Adult Updates (ages 17+):
Mean: 19.1 per record
Median: 5
Max: 16,166
Total: 45.4M updates
| File | Format | Records | Description |
|---|---|---|---|
aadhaar_demographics_cleaned.csv |
CSV | 2.4M | Cleaned daily pincode-level data |
df_monthly_district.csv |
CSV | 10,560 | District monthly aggregates |
df_monthly_state.csv |
CSV | 630 | State monthly aggregates |
df_monthly_pincode.csv |
CSV | ~200K | Pincode monthly aggregates |
district_summary.csv |
CSV | 1,056 | District-level summary statistics |
pincode_summary.csv |
CSV | ~20K | Pincode-level summary statistics |
- 1,056 districts across 63 states/UTs
- ~20,000 unique pincodes
- Comprehensive national coverage
- Peak Month: December 2025 (10.51M updates)
- Lowest Month: January 2026 (583K updates)
- 18× variation between peak and trough
- Child Updates: 9.07% of total
- Adult Updates: 90.93% of total
- Zero-Child Records: 44% of all records
- ✅ No negative update counts
- ✅ All dates within valid range
- ✅ All pincodes 6-digit numeric
- ✅ All districts mapped to valid states
- ✅ No orphaned geographic records
- ✅ Total updates = child + adult (all records)
- ✅ Child share ≤ 100% (all records)
- ✅ Age ratio ≥ 0 (all records)
- ✅ Monthly aggregates sum to raw totals
- Univariate Analysis → Explore distributions
- Bivariate Analysis → Identify relationships
- Trivariate Analysis → Uncover complex patterns
- Layer 1: Migration Radar → Detect population movements
- Layer 2: Child Risk Map → Identify documentation gaps
- Layer 3: System Intelligence → Assess stability
- Layer 4: Early Warning → Generate alerts
- Data Types: Optimized int32/int64 for memory efficiency
- Indexing: Date and district columns indexed for faster queries
- Pickle Serialization:
.pklfiles for faster loading in subsequent notebooks
- Random Seed: Not applicable (deterministic processing)
- Environment: Python 3.8+, pandas 1.3+, numpy 1.21+
- Execution Time: ~5 minutes on standard hardware
Last Updated: January 2026
Maintainer: ADIEWS Project Team