Skip to content

Latest commit

 

History

History
344 lines (249 loc) · 13.1 KB

File metadata and controls

344 lines (249 loc) · 13.1 KB

Bivariate Analysis - ADIEWS

Notebook: 02_bivariate_analysis.ipynb
Status: ✅ Complete
Visualizations: 5 PNG files


Overview

This phase examines relationships between two variables simultaneously to uncover patterns, correlations, and dependencies within Aadhaar demographic update data. The analysis reveals how child and adult update behaviors interact across geographic and temporal dimensions.


📊 Analysis Components

1. Child vs Adult Updates Correlation

Scatter Plot Analysis (bivariate_child_vs_adult_scatter.png)

Statistical Relationship:

Metric Value
Pearson Correlation 0.8507 (Strong Positive)
Linear Trend y = 0.1003x - 0.01
R² Value 0.724
Interpretation For every 10 adult updates, expect ~1 child update

Key Findings:

  • Strong positive correlation (r = 0.85) indicates areas with high adult activity also show higher child activity
  • Slope coefficient (0.1003) reveals systematic 10:1 adult-to-child ratio
  • Intercept near zero (-0.01) suggests proportional relationship across all volume levels
  • Outliers exist where child activity disproportionately high/low relative to adults

Real-World Interpretation:

  • Geographic areas with robust Aadhaar infrastructure serve both demographics
  • Child updates "piggyback" on adult enrollment drives
  • Isolated high-adult/low-child clusters indicate service barriers for children

Joint Plot Analysis (bivariate_child_vs_adult_jointplot.png)

Marginal Distributions:

  • Adult Updates: Right-skewed distribution centered at 5-20 updates
  • Child Updates: Extreme right-skew with 44% concentration at zero
  • Joint Density: Highest concentration in low-volume quadrant (0-50 updates both axes)

Bivariate Insights:

  • Most records cluster in the origin (low child, low adult)
  • Sparse high-volume records dominate total update counts
  • No significant child-only or adult-only clusters (correlation holds throughout)

2. Age Update Ratio Analysis

Distribution Statistics (bivariate_age_ratio_distribution.png)

Statistic Value Interpretation
Mean Ratio 0.1161 Average 11.6 child updates per 100 adult updates
Median Ratio 0.0000 Half of all records have zero child updates
Std Deviation 0.2442 High variability across geographic areas
25th Percentile 0.0000 Bottom quarter has zero child activity
75th Percentile 0.1429 Top quarter shows 14.3% child ratio
Maximum 1.0833 Some areas exceed 1:1 parity

Distribution Characteristics:

  • Extreme right-skew: Median of zero indicates pervasive child documentation gaps
  • High variance: σ = 0.24 reveals geographic inequality
  • Long tail: Outliers reach 1.08 (more child than adult updates)

Policy Implications:

  • 50% of areas have zero child-to-adult ratio (critical intervention zone)
  • Top quartile (ratio > 0.14) represents benchmark for successful child engagement
  • Outlier districts (ratio > 0.5) warrant investigation for best practices

3. District-Level Analysis

Top Performing Districts (Highest Child/Adult Ratio)

Rank District State Ratio Child Updates Adult Updates Volume Category
1 Tiruvarur Tamil Nadu 1.0833 6,500 6,000 High Child Success
2 Thiruvarur Tamil Nadu 0.9823 5,892 6,000 High Child Success
3 Nagapattinam Tamil Nadu 0.8456 4,273 5,053 Above Average
4 Mayiladuthurai Tamil Nadu 0.7891 3,158 4,003 Above Average
5 Sivaganga Tamil Nadu 0.7234 2,891 3,997 Above Average

Tamil Nadu Pattern:

  • 5 of Top 10 districts from Tamil Nadu
  • State Average Ratio: 0.45 (4× national average of 0.116)
  • Success Factors: School-based enrollment drives, PDS linkage, strong local governance

Highest Volume Districts (Total Updates)

Rank District State Total Updates Child Adult Ratio
1 Pune Maharashtra 447,263 38,917 408,346 0.095
2 Thane Maharashtra 389,456 31,223 358,233 0.087
3 Bangalore Urban Karnataka 356,789 35,678 321,111 0.111
4 Ahmedabad Gujarat 289,234 28,643 260,591 0.110
5 Hyderabad Telangana 267,890 26,789 241,101 0.111

Urban Pattern:

  • Metropolitan districts drive volume but maintain 9-11% child share (below national 11.6%)
  • Scale vs Equity Trade-off: High absolute numbers mask lower proportional child engagement
  • Opportunity: Urban centers have infrastructure but underperform on child outreach

Lowest Performing Districts (Ratio < 0.01)

District State Ratio Adult Updates Child Updates Alert Status
Washim Maharashtra 0.005 8,456 42 CRITICAL
Buldana Maharashtra 0.008 12,234 98 CRITICAL
Bid Maharashtra 0.009 15,678 141 HIGH
Gondia Maharashtra 0.016 9,234 148 HIGH
Yavatmal Maharashtra 0.018 19,456 350 MODERATE

Maharashtra Cluster:

  • 5 districts in Maharashtra have ratio < 0.02 (extremely low child engagement)
  • Common factors: Rural, agrarian economy, seasonal migration patterns
  • Correlation with Layer 1: Overlap with high migration volatility zones

4. Pincode-Level Concentration Analysis

Geographic Concentration Metrics (bivariate_pincode_concentration.png)

Concentration Level Pincode % Update % Interpretation
Top 1% 198 pincodes 12.53% Elite zones drive disproportionate volume
Top 5% 989 pincodes 32.29% One-third of updates in 5% of areas
Top 10% 1,977 pincodes 46.74% Half of all updates in 10% of geography
Top 20% 3,954 pincodes 64.12% Two-thirds from one-fifth of areas
Bottom 50% 9,886 pincodes 8.23% Half of pincodes contribute <10% of activity

Lorenz Curve Analysis:

  • Gini Coefficient: 0.67 (high inequality)
  • 80/20 Rule Violation: Actually 80% of updates from ~18% of pincodes (even more concentrated)
  • Policy Implication: Targeted interventions in top 1,000 pincodes could impact 50%+ of population

Top 20 Highest Volume Pincodes

Rank Pincode District State Total Updates % of National
1 411001 Pune Maharashtra 89,456 0.179%
2 560001 Bangalore Urban Karnataka 78,234 0.156%
3 400001 Mumbai Maharashtra 67,890 0.136%
4 380001 Ahmedabad Gujarat 56,789 0.114%
5 500001 Hyderabad Telangana 54,321 0.109%
... ... ... ... ... ...
20 226001 Lucknow Uttar Pradesh 31,234 0.063%

Urban Core Dominance:

  • All Top 20 are city center pincodes
  • Combined Share: 2.67% of total updates from 0.1% of pincodes
  • Implication: Service infrastructure heavily concentrated in metros

5. District Comparison Analysis

Comparative Performance Matrix (bivariate_district_comparison.png)

Quadrant Classification:

Quadrant Criteria Districts Characteristics
Q1: High-High High Child, High Adult 127 (12%) Well-resourced urban centers
Q2: High-Low High Child, Low Adult 43 (4%) Child-focused intervention zones
Q3: Low-Low Low Child, Low Adult 712 (67%) Rural, underserved areas
Q4: Low-High Low Child, High Adult 174 (17%) Service gap for children

Actionable Insights:

  • Q1 Districts (e.g., Chennai, Bangalore): Scale success models nationally
  • Q2 Districts (e.g., Tiruvarur): Study child-specific best practices
  • Q3 Districts (e.g., tribal belts): Require comprehensive infrastructure investment
  • Q4 Districts (e.g., industrial zones): Targeted child campaigns in existing facilities

📈 Statistical Summary

Correlation Matrix

Child Updates Adult Updates Child Ratio Total Updates
Child Updates 1.000 0.851 0.723 0.889
Adult Updates 0.851 1.000 -0.156 0.997
Child Ratio 0.723 -0.156 1.000 -0.089
Total Updates 0.889 0.997 -0.089 1.000

Key Correlations:

  1. Child-Adult: +0.851 (strong positive) → Co-location effect
  2. Ratio-Adult: -0.156 (weak negative) → High volume doesn't guarantee child equity
  3. Ratio-Child: +0.723 (strong positive) → Child-focused areas show high ratios

Regression Analysis

Linear Model: Child Updates = 0.1003 × Adult Updates - 0.01

Coefficient Estimate Std Error t-statistic p-value
Intercept -0.01 0.045 -0.22 0.826
Slope (Adult → Child) 0.1003 0.0012 83.58 < 0.001

Model Diagnostics:

  • R² = 0.724: Model explains 72.4% of child update variance
  • RMSE = 6.82: Average prediction error of ~7 child updates
  • F-statistic: 6,985.7 (p < 0.001) → Highly significant relationship

🎯 Key Insights

Geographic Patterns

  1. Urban-Rural Divide:

    • Urban districts: High volume, moderate ratio (0.09-0.12)
    • Rural districts: Low volume, variable ratio (0.00-0.25)
    • Semi-urban: Optimal balance (0.12-0.18 ratio with moderate volume)
  2. State-Level Variations:

    • Best Performers: Tamil Nadu (0.45), Kerala (0.38), Karnataka (0.35)
    • Underperformers: Maharashtra rural (0.08), UP rural (0.09), Bihar (0.11)
    • National Average: 0.116
  3. Pincode Concentration:

    • Extreme inequality: Top 1% pincodes = 12.5% of updates
    • Service deserts: Bottom 50% pincodes = 8% of updates
    • Policy lever: 1,000 strategic pincodes control 50% of reach

Behavioral Insights

  1. Co-Location Effect:

    • Strong 0.85 correlation suggests infrastructure determines both child and adult access
    • Implies expanding adult centers automatically improves child reach
  2. Ratio Paradox:

    • Highest volume districts ≠ highest child ratios
    • Urban centers underperform on proportional child engagement despite resources
  3. Zero-Inflation Problem:

    • 44% of records have zero child updates despite adult activity
    • Indicates systematic barriers beyond infrastructure availability

📊 Visualizations Generated

File Type Key Finding
bivariate_child_vs_adult_scatter.png Scatter + Regression 0.85 correlation, 10:1 ratio
bivariate_child_vs_adult_jointplot.png Joint Distribution Zero-inflation in child updates
bivariate_age_ratio_distribution.png Histogram 50% have zero child ratio
bivariate_district_comparison.png Quadrant Analysis 67% in low-low quadrant
bivariate_pincode_concentration.png Lorenz Curve Top 1% = 12.5% of updates

🚀 Policy Recommendations

Immediate Actions (0-3 months)

  1. Tamil Nadu Model Replication:

    • Study districts with ratio > 0.50
    • Document school linkage mechanisms
    • Pilot in 50 low-ratio districts
  2. Urban Child Campaigns:

    • Target Q4 districts (high adult, low child)
    • Leverage existing infrastructure
    • School-based enrollment drives
  3. Concentration Strategy:

    • Intensify efforts in top 1,000 pincodes
    • Mobile camps in bottom 5,000 pincodes
    • Resource allocation based on Lorenz curve

Medium-Term Interventions (3-12 months)

  1. Zero-Ratio District Elimination:

    • Mandatory child enrollment in all adult centers
    • Incentive structure for ratio improvement
    • Monthly monitoring of bottom 100 districts
  2. Infrastructure Equity:

    • New enrollment centers in bottom 50% pincodes
    • Mobile units for seasonal migration zones
    • Digital kiosks in schools
  3. Data-Driven Targeting:

    • Use bivariate clusters for resource allocation
    • Predictive models for child documentation gaps
    • Real-time dashboard for ratio monitoring

📚 Technical Notes

Methodological Choices

Correlation Method: Pearson correlation (assumes linear relationship)

  • Tested Spearman rank correlation: ρ = 0.82 (consistent with Pearson)
  • Conclusion: Linear assumption valid

Outlier Treatment:

  • Retained outliers (represent real high-volume zones)
  • Sensitivity analysis: r = 0.83 after removing top 1% (minimal impact)

Aggregation Level:

  • Analysis at pincode-month grain for maximum resolution
  • District aggregates used for policy communication

Statistical Significance

All reported correlations significant at α = 0.001 level:

  • Child-Adult correlation: p < 0.0001
  • Ratio distributions: Shapiro-Wilk p < 0.001 (non-normal confirmed)
  • District comparisons: ANOVA F = 156.7, p < 0.001

Last Updated: January 2026
Maintainer: ADIEWS Project Team