This project performs an in-depth Exploratory Data Analysis (EDA) on the Pima Indians Diabetes Dataset. The goal is to investigate the relationship between various health metrics (like Glucose, BMI, and Age) and the onset of diabetes.
By analyzing the data distribution and correlations, we identify which factors are the strongest predictors of the disease.
The dataset includes several medical predictor variables and one target variable (Outcome):
- Pregnancies: Number of times pregnant.
- Glucose: Plasma glucose concentration.
- BloodPressure: Diastolic blood pressure (mm Hg).
- SkinThickness: Triceps skin fold thickness (mm).
- Insulin: 2-hour serum insulin (mu U/ml).
- BMI: Body mass index (weight in kg/(height in m)^2).
- DiabetesPedigreeFunction: Diabetes likelihood based on family history.
- Age: Age in years.
- Outcome: Class variable (0 = Non-diabetic, 1 = Diabetic).
- Data Cleaning: Identifying and handling missing values (zeros in Glucose/Insulin/BP).
- Descriptive Statistics: Summary of mean, median, and variance across health metrics.
- Distribution Analysis: Using Histograms and KDE plots to see the spread of the data.
- Outlier Detection: Using Boxplots to identify extreme health readings.
- Correlation Mapping: Using Heatmaps to see how features like BMI and Glucose relate to the Outcome.
- Class Balance: Checking the ratio of Diabetic vs. Non-diabetic cases.
- Language: Python
- Libraries:
Pandas(Data Cleaning)NumPy(Mathematical Operations)Matplotlib&Seaborn(Visualizations)
- Glucose & BMI: Show the strongest positive correlation with a positive Diabetes outcome.
- Age Factor: Older individuals in this dataset show a higher frequency of being diabetic.
- Insulin Levels: A significant number of missing values (zeros) were found in the Insulin column, requiring specific data imputation strategies.
βββ diabetes.csv # Raw dataset
βββ EDA_Diabetes.ipynb # Main Jupyter Notebook
βββ requirements.txt # List of dependencies
βββ README.md # Project documentation
- Clone the repo:
git clone https://github.com/Akshat8510/EDA-on-Diabetes_Dataset.git
- Install libraries:
pip install pandas seaborn matplotlib numpy
Developed by Akshat as part of a Data Science portfolio.