This project implements a Decision Tree Classifier to predict the presence of heart disease based on patient medical data. It covers the complete machine learning workflow including EDA, preprocessing, model building, evaluation, and hyperparameter tuning.
- Apply Decision Tree Classification on a real-world dataset
- Perform data preprocessing and feature engineering
- Evaluate model performance using multiple metrics
- Optimize model using hyperparameter tuning
- Interpret results using visualization and feature importance
- File:
heart_disease.xlsx - Contains features such as:
- Age, Sex
- Chest pain type (
cp) - Resting ECG (
restecg) - Cholesterol, Blood Pressure
- Maximum heart rate
- Target variable (
num) β Heart disease presence
- Checked dataset structure using
.info()and.describe() - Identified missing values and duplicates
- Detected outliers using:
- Boxplots
- IQR (Interquartile Range) method
- Visualizations performed:
- Histograms for feature distribution
- Pairplot for relationships
- Correlation heatmap
- Target class distribution
- Removed outliers using IQR method
- Applied capping (winsorization) to limit extreme values
- Label Encoding for categorical variables
- One-Hot Encoding using
pd.get_dummies()
- Applied Standard Scaling using
StandardScaler
- Converted target variable into binary:
0 β No Disease1 β Disease
- Split dataset into:
- 80% Training
- 20% Testing
- Trained using
DecisionTreeClassifier
Performance evaluated using:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix
- ROC-AUC Score
Visualizations:
- ROC Curve
- Decision Tree Structure
Used GridSearchCV to optimize:
criterion(gini / entropy)max_depthmin_samples_splitmin_samples_leaf
β Improved model performance after tuning
- Visualized decision tree using
plot_tree() - Extracted feature importance to identify key predictors
- Understanding how Decision Trees split data using Gini and Entropy
- Importance of hyperparameter tuning to avoid overfitting
- Handling outliers improves model stability
- Difference between Label Encoding and One-Hot Encoding
- Model interpretability using tree visualization and feature importance
max_depthβ Controls tree depth (overfitting vs underfitting)min_samples_splitβ Minimum samples to split a nodemin_samples_leafβ Minimum samples in a leaf nodecriterionβ Split quality (gini / entropy)
π Proper tuning balances bias and variance
| Feature | Label Encoding | One-Hot Encoding |
|---|---|---|
| Output | Integer values | Binary columns |
| Use Case | Ordinal data | Nominal data |
| Risk | Introduces false order | High dimensionality |
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
bash git clone https://github.com/MeghanaCVarghese/Decision-Tree-Classifier
pip install -r requirements.txt
jupyter notebook
Meghana C Varghese




