Skip to content

shouviksarkar123/Healthcare_Readmission_AI

Repository files navigation

Predicting 30-day Hospital Readmission Risk using Databricks

Sponsor: Databricks Organisers: Indian Data Club (IDC) & Codebasics

Hashtags:

ResumeProjectChallenge #DatabricksWithIDC #Codebasics

#๐Ÿ“ŒOverview

Hospital readmissions within 30 days significantly increase healthcare costs and indicate gaps in post-discharge care. This project builds an end-to-end AI-powered decision support system that predicts whether a diabetic patient is at high risk of hospital readmission within 30 days at the time of discharge. The solution is designed and implemented entirely on Databricks using Medallion Architecture, Delta Lake, MLflow, and SQL analytics.

#๐Ÿ“ฝ๏ธ Project Walkthrough

โ–ถ๏ธ Video (10 mins, unlisted): https://youtu.be/H9m4L_l98YE

#๐Ÿ“Š Presentation Deck

๐Ÿš€ Live Dashboard Access

๐Ÿ”— Healthcare Readmission AI โ€“ Interactive Dashboard:
๐Ÿ‘‰ https://healthcare-readmission-app-fvdrqj4vgjnvdp2qxfjje7.streamlit.app/#healthcare-readmission-dashboard-analysis LIVE DASHBOARD Live Dashboard view


An executive-ready dashboard built on Databricks that visualizes patient readmission risk, utilization patterns, and clinical drivers to support proactive healthcare decisions.


#Problem Statement

Hospitals struggle to identify which patients are likely to be readmitted within 30 days after discharge.

The challenge is to:

  • Predict 30-day readmission risk accurately
  • Avoid data leakage
  • Translate predictions into actionable risk categories
  • Enable proactive care decisions instead of reactive treatment #Why AI (Instead of Rule-Based Systems)

Rule-based systems fail to capture complex patient behavior and interactions between multiple clinical and utilization factors.

Machine Learning is required to:

  • Learn patterns from historical patient data
  • Generalize risk beyond fixed thresholds
  • Adapt to complex, non-linear relationships #Dataset Description

README: Diabetes 130-US Hospitals (1999โ€“2008)

๐Ÿ“Œ Dataset Overview

This dataset represents ten years of clinical care for patients diagnosed with diabetes. Each record corresponds to a hospital admission (inpatient encounter) where laboratory tests and medications were administered, with a stay between 1โ€“14 days.


๐ŸŽฏ Objective

The primary goal is to predict early readmission within 30 days of discharge.
This is important because:

  • Poor diabetes management increases hospital costs.
  • Readmissions affect patient morbidity and mortality.
  • Proper glycemic control and preventive care can reduce complications.

๐Ÿ“‚ Meta Data

  • diabetic_data.csv (18.3 MB) โ†’ Main dataset
  • IDS_mapping.csv (2.5 KB) โ†’ Mapping of categorical IDs to descriptive values

๐Ÿงพ Data Characteristics

  • Type: Multivariate
  • Tasks: Classification, Clustering
  • Sensitive Attributes: Age, Gender, Race

Example Features

Feature Type Description
encounter_id ID Unique identifier of encounter
patient_nbr ID Unique identifier of patient
race Categorical Caucasian, Asian, African American, Hispanic, Other
gender Categorical Male, Female, Unknown
age Categorical Grouped in 10-year intervals
admission_type_id Categorical Emergency, urgent, elective, newborn, etc.
discharge_disposition_id Categorical 29 values (home, expired, etc.)
time_in_hospital Integer Number of days admitted
num_lab_procedures Integer Number of lab tests performed
HbA1c_result Categorical Glycated hemoglobin test result
num_medications Integer Number of medications prescribed

โš ๏ธ Missing Values

  • Some attributes (e.g., race, weight) contain missing values.
  • Preprocessing is required before applying ML models.

๐Ÿ“– Reference Paper

  • Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records
    Authors: Beata Strack, Jonathan DeShazo, Chris Gennings, Juan Olmo, Sebastian Ventura, Krzysztof Cios, John Clore
    Published: BioMed Research International, 2014

โœ… Usage Notes

  • Recommended splits: Train/Test or Train/Validation/Test.
  • Suitable for tasks like predictive modeling, risk stratification, and healthcare analytics.
  • Handle sensitive attributes responsibly (age, race, gender).

Solution Architecture (Medallion Design)


Bronze Layer

  • Raw, unchanged dataset ingestion
  • Source of truth

Silver Layer

  • Data cleaning and preprocessing
  • Leakage removal
  • Feature engineering
  • ML-ready dataset

Gold Layer

  • Business-focused features
  • Risk segmentation (High / Medium / Low)
  • Decision-ready tables

ML Layer

  • Logistic Regression model
  • MLflow experiment tracking

Analytics Layer

  • Databricks SQL dashboards
  • Risk and utilization insights

Technology Stack


  • Databricks Community Edition
  • Delta Lake (ACID, versioning)
  • PySpark
  • MLflow
  • Databricks SQL
  • Streamlit (Web deployment)

Feature Engineering Highlights

Key engineered features include:

  • Age buckets for interpretability
  • Hospital utilization score combining visit counts
  • Treatment change flag indicating instability
  • Leakage-free target label (readmit_30d)

Machine Learning Approach

  • Problem Type: Binary Classification
  • Model Used: Logistic Regression
  • Reason: Explainability, reliability, healthcare suitability
  • Evaluation Metric: AUC (handles class imbalance)

Experiment Tracking with MLflow

MLflow is used to:

  • Track model parameters
  • Log evaluation metrics
  • Maintain reproducibility
  • Enable model comparison

Analytics & Dashboard

Databricks SQL dashboards provide:

  • Risk distribution across patients
  • Risk by age group
  • Relationship between utilization and readmission
  • Executive-level summary insights WhatsApp Image 2026-01-31 at 20 33 33 (1)

Buisness Impact

This system enables hospitals to:

  • Identify high-risk patients early
  • Prioritize post-discharge care
  • Reduce avoidable readmissions
  • Optimize resource planning

#๐Ÿ“‚ Project Structure Healthcare_Readmission_AI โ”‚

โ”œโ”€โ”€ 00_DOCS

โ”œโ”€โ”€ 01_bronze_ingestion

โ”œโ”€โ”€ 02_silver_processing

โ”œโ”€โ”€ 03_gold_features

โ”œโ”€โ”€ 04_ml_mlflow

โ”œโ”€โ”€ 05_sql_analytics

โ”œโ”€โ”€ 06_jobs_orchestration

โ”œโ”€โ”€ Dashboard

โ””โ”€โ”€ README.md Floder Structure


Model Evaluation

The moderate AUC score reflects the inherent complexity and noise in real-world healthcare data, reinforcing the importance of combining ML outputs with domain-driven decision rules. Experiments

Results:

  • ROC-AUC: ~0.60
  • Accuracy: ~0.89 All experiments, parameters, and metrics were tracked using MLflow to ensure reproducibility and auditability.

AI Decision System (Beyond Just Prediction)

Instead of stopping at raw prediction probabilities, this project implements a decision layer on top of the ML model.

Model outputs are transformed into actionable risk categories:

  • Low Risk: No immediate intervention required
  • Medium Risk: Follow-up monitoring recommended
  • High Risk: Proactive care planning suggested

Risk buckets are derived using quantile-based thresholds on predicted readmission probability, ensuring stable and interpretable segmentation.

This approach bridges the gap between machine learning outputs and real-world clinical decision-making, enabling healthcare teams to prioritize patients effectively rather than interpreting raw model scores.


#Orchestration

The pipeline is automated using Databricks Jobs with task dependencies: Bronze โ†’ Silver โ†’ Gold โ†’ ML Training jobs  pipeline


#Challenge Requirements Compliance

This project was developed as part of the Codebasics Resume Project Challenge, sponsored by Databricks and organised by Indian Data Club (IDC) and Codebasics.

This project satisfies all challenge requirements by:

  • Framing a real-world healthcare decision problem
  • Designing a Medallion-based data architecture
  • Implementing ML with experiment tracking
  • Translating predictions into business actions
  • Delivering dashboards and documentation

๐Ÿ”ฎ Future Improvements

This project establishes a strong end-to-end foundation for hospital readmission risk prediction.
Future enhancements can further improve accuracy, scalability, and real-world adoption:

  • Advanced Models: Experiment with tree-based models (Random Forest, XGBoost) or ensemble methods to capture non-linear patient risk patterns.

  • Feature Expansion: Incorporate additional clinical signals such as lab trends, diagnosis history, and medication sequences for richer patient profiling.

  • Time-Aware Modeling: Move from static features to temporal models that account for patient history across multiple encounters.

  • Threshold Optimization: Tune decision thresholds based on hospital capacity, cost sensitivity, or patient risk tolerance.

  • Model Explainability: Integrate SHAP or feature attribution techniques to provide clinician-friendly explanations for predictions.

  • Continuous Training: Enable scheduled retraining using Databricks Jobs to keep the model updated with new data.

  • Governance & Monitoring: Add model drift detection, data quality checks, and audit logging for production-grade reliability.

  • Clinical Workflow Integration: Embed predictions into hospital systems or care management tools for real-time discharge decision support.


๐Ÿ‘ค Author & Project Owner

Shouvik Sarkar
Aspiring Data Engineer | Data Scientist
Specialization: Healthcare Readmission AI, Databricks, ML Pipelines

๐Ÿ”— LinkedIn Profile:
www.linkedin.com/in/shouvik-sarkar-619782279

๐Ÿ“ฉ Open to collaboration, feedback, and opportunities in Data Engineering & AI-driven Healthcare Analytics.


About

AI-Driven Patient Readmission Risk Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors