Skip to content

nuemaan/aadhaar-biometric-forecast

Repository files navigation

Project Report: Aadhaar Biometric Forecast

Predicting Mandatory Update Demand Among Youth (Age 15)

1. Executive Summary

This project successfully developed a Machine Learning model to forecast the volume of mandatory biometric updates for 15-year-olds across Indian states. By analyzing historical enrolment and update trends, we built a Gradient Boosting Regressor that achieves ~70% accuracy (). The model identifies critical bottlenecks in advance (e.g., a predicted surge in Uttar Pradesh in Jan 2026), enabling policymakers to allocate resources proactively rather than reactively.


2. Problem Statement

  • The Challenge: Every Indian resident must update their biometrics (fingerprint, iris, photo) upon turning 15. Failure to do so leads to authentication failures.
  • The Gap: Current administrative planning is often reactive. There is no predictive tool to estimate when and where teenagers will show up for updates.
  • The Goal: Build a time-series forecasting model to predict monthly biometric update volume by state for the next quarter.

3. Methodology: How We Did It

A. Data Strategy

We utilized official Open Data from UIDAI, specifically:

  1. Biometric Update Dataset: The target variable (specifically bio_age_5_17).
  2. Aadhaar Enrolment Dataset: The predictor signal (population density of age_5_17).
  3. Demographic Update Dataset: Tested as a leading indicator (discarded later due to noise).

B. Data Preparation

  • Aggregation: We aggregated millions of raw rows into a structured time-series format: [State, Month, Year, Count].
  • Cleaning: Handled missing dates and aligned inconsistent time formats.
  • Feature Engineering: This was the critical success factor. We created "Lag Features" (Memory), teaching the model that updates last month are the strongest predictor of updates this month.

4. The Experiment Log: What Worked vs. What Failed

We followed a rigorous scientific process, iterating through 9 different versions (V1–V9).

Experiment Approach Result () Verdict
V1–V2 Baseline Random Forest


Used only Month & Population counts. | ~0.45 | Failed. The model couldn't predict "spikes" because it had no memory of recent events. | | V3 | Lag Features (Memory)


Added updates_last_month as a feature. | ~0.69 | Success. Huge jump in accuracy. The model learned "momentum." | | V4 | Rolling Averages


Used 3-month average instead of 1-month lag. | ~0.62 | Failed. Averaging "smoothed out" the data too much, missing the sharp spikes. | | V5 | Gradient Boosting


Switched from Random Forest to Histogram Gradient Boosting. | ~0.70 | Winner. The new engine squeezed out maximum accuracy. | | V7 | Demographic Data


Added demographic updates as a signal. | ~0.67 | Failed. Added more noise than signal. Complexity reduced accuracy. | | V8 | District-Level Granularity


modeled ~750 districts instead of ~30 states. | ~0.64 | Failed. Local data was too volatile/noisy to predict reliably. | | V9 | Hyperparameter Tuning


Grid Search for optimal settings. | 0.65 | Failed. Confirmed that our V5 baseline settings were already optimal. |


5. The Solution

We selected Model V5 as the final production model.

  • Algorithm: HistGradientBoostingRegressor (Scikit-Learn implementation of LightGBM).

  • Key Features:

  • updates_last_month: The primary driver (Momentum).

  • age_5_17: The population base (Capacity).

  • month: Seasonality (School holidays/Exam cycles).

  • Validation Strategy: Time-series split (Training on past, Testing on "future" unseen data).

  • Final Accuracy: 70% (). In human behavioral forecasting, this is considered a high-performance score.


6. Key Findings & Actionable Insights

A. Bottleneck Alert (Jan 2026)

The model forecasts a massive surge in update demand for Q1 2026.

  • Hotspot: Uttar Pradesh.
  • Projected Volume: >600,000 updates required in January alone.
  • Recommendation: Deploy 30% of mobile enrollment units to UP districts immediately to prevent overcrowding.

7. Conclusion

This project demonstrates that machine learning can effectively guide Aadhaar administrative planning. By shifting from reactive counting to proactive forecasting, UIDAI can ensure smoother service delivery for millions of Indian teenagers.

While we explored granular District-Level modeling ( 0.64), we concluded that State-Level forecasting ( 0.70) offers the optimal balance of accuracy and reliability for national resource allocation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages