Skip to content

Analyze, visualize, and build predictive models on the Diabetes dataset using NumPy, Pandas, Matplotlib, Seaborn, and Scikit-Learn.

Notifications You must be signed in to change notification settings

Akshat8510/EDA-on-Diabetes_Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🩺 Exploratory Data Analysis (EDA) on Diabetes Dataset

Python Pandas Data Viz

πŸ“Œ Project Overview

This project performs an in-depth Exploratory Data Analysis (EDA) on the Pima Indians Diabetes Dataset. The goal is to investigate the relationship between various health metrics (like Glucose, BMI, and Age) and the onset of diabetes.

By analyzing the data distribution and correlations, we identify which factors are the strongest predictors of the disease.

πŸ“Š Dataset Features

The dataset includes several medical predictor variables and one target variable (Outcome):

  • Pregnancies: Number of times pregnant.
  • Glucose: Plasma glucose concentration.
  • BloodPressure: Diastolic blood pressure (mm Hg).
  • SkinThickness: Triceps skin fold thickness (mm).
  • Insulin: 2-hour serum insulin (mu U/ml).
  • BMI: Body mass index (weight in kg/(height in m)^2).
  • DiabetesPedigreeFunction: Diabetes likelihood based on family history.
  • Age: Age in years.
  • Outcome: Class variable (0 = Non-diabetic, 1 = Diabetic).

πŸš€ Key Analysis Steps

  1. Data Cleaning: Identifying and handling missing values (zeros in Glucose/Insulin/BP).
  2. Descriptive Statistics: Summary of mean, median, and variance across health metrics.
  3. Distribution Analysis: Using Histograms and KDE plots to see the spread of the data.
  4. Outlier Detection: Using Boxplots to identify extreme health readings.
  5. Correlation Mapping: Using Heatmaps to see how features like BMI and Glucose relate to the Outcome.
  6. Class Balance: Checking the ratio of Diabetic vs. Non-diabetic cases.

πŸ› οΈ Tech Stack

  • Language: Python
  • Libraries:
    • Pandas (Data Cleaning)
    • NumPy (Mathematical Operations)
    • Matplotlib & Seaborn (Visualizations)

πŸ“ˆ Key Insights (Sample)

  • Glucose & BMI: Show the strongest positive correlation with a positive Diabetes outcome.
  • Age Factor: Older individuals in this dataset show a higher frequency of being diabetic.
  • Insulin Levels: A significant number of missing values (zeros) were found in the Insulin column, requiring specific data imputation strategies.

πŸ“‚ Project Structure

β”œβ”€β”€ diabetes.csv         # Raw dataset
β”œβ”€β”€ EDA_Diabetes.ipynb   # Main Jupyter Notebook
β”œβ”€β”€ requirements.txt     # List of dependencies
└── README.md            # Project documentation

βš™οΈ Installation

  1. Clone the repo:
    git clone https://github.com/Akshat8510/EDA-on-Diabetes_Dataset.git
  2. Install libraries:
    pip install pandas seaborn matplotlib numpy

Developed by Akshat as part of a Data Science portfolio.

About

Analyze, visualize, and build predictive models on the Diabetes dataset using NumPy, Pandas, Matplotlib, Seaborn, and Scikit-Learn.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published