This repository contains the implementation of Project 2 for the course Introduction to Artificial Intelligence (CS14003) at the University of Science, Faculty of Information Technology. The project focuses on building and evaluating Decision Tree classifiers using the scikit-learn library on three datasets:
- UCI Heart Disease Dataset: A binary classification dataset to predict the presence of heart disease (303 samples).
- Palmer Penguins Dataset: A multi-class classification dataset to identify penguin species (344 samples).
- Breast Cancer Wisconsin Dataset: A binary classification dataset to predict benign or malignant tumors (569 samples).
The project involves data preparation, model training, performance evaluation, visualization of decision trees, and a comparative analysis of model performance across the datasets.
project2_decision_tree/
│
├── notebooks/
│ ├── heart_disease.ipynb # Analysis for UCI Heart Disease dataset
│ ├── palmer_penguins.ipynb # Analysis for Palmer Penguins dataset
│ ├── breast_cancer.ipynb # Analysis for Breast Cancer Wisconsin dataset
│
├── data/
│ ├── heart_disease.csv # UCI Heart Disease dataset (optional)
│ ├── penguins.csv # Palmer Penguins dataset (optional)
│ ├── breast_cancer.csv # Breast Cancer Wisconsin dataset (optional)
│
├── report/
│ ├── project_report.pdf # Final report with analysis and insights
│
├── README.md # This file
└── requirements.txt # Dependencies for pip
The project requires Python 3.8+ and the following libraries:
scikit-learnpandasmatplotlibseaborngraphvizpython-graphvizjupyternbconvert
-
Install Python:
-
Download Python 3.8+ from python.org.
-
Verify:
python --version
-
-
Create a virtual environment (recommended):
python -m venv decision_tree_env source decision_tree_env/bin/activate # On Windows: decision_tree_env\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Or manually:
pip install scikit-learn pandas matplotlib seaborn jupyter graphviz nbconvert
-
Install Graphviz:
-
Download from graphviz.org.
-
Add Graphviz to PATH (e.g.,
C:\Program Files\Graphviz\binon Windows). -
Verify:
dot -V
-
-
Verify libraries:
import sklearn, pandas, matplotlib, seaborn, graphviz print("Libraries installed successfully!")
-
Clone the repository:
git clone https://github.com/Low-rain-falls/CSC14003_Decision_Tree.git cd CSC14003_Decision_Tree -
Activate the environment:
-
pip:
python3 -m venv venv source venv/bin/activate
-
-
Launch Jupyter Notebook:
jupyter notebook
-
Run notebooks:
- Open
notebooks/heart_disease.ipynb,palmer_penguins.ipynb, orbreast_cancer.ipynbin the Jupyter interface. - Execute all cells to perform the analysis.
- Open
-
Export to PDF (optional):
jupyter nbconvert --to pdf notebooks/[notebook_name].ipynb
-
UCI Heart Disease:
- Source: UCI Machine Learning Repository
- Description: 303 samples, 13 features, binary classification (0: no heart disease, 1: heart disease).
- Access: Loaded via URL or
data/heart_disease.csv.
-
Palmer Penguins:
- Source: seaborn or Palmer Penguins
- Description: 344 samples, 6 features, 3 classes (Adelie, Chinstrap, Gentoo).
- Access: Loaded via
seaborn.load_dataset('penguins')ordata/penguins.csv.
-
Breast Cancer Wisconsin:
- Source: UCI Machine Learning Repository
- Description: 569 samples, 30 features, binary classification (benign or malignant).
- Access: Loaded via
sklearn.datasets.load_breast_cancer()ordata/breast_cancer.csv.
- Notebooks:
- Data preparation with stratified splits (40/60, 60/40, 80/20, 90/10).
- Decision tree training and visualization using Graphviz.
- Evaluation with classification reports and confusion matrices.
- Analysis of max_depth impact on accuracy (80/20 split).
- Report (
report/project_report.pdf):- Visualizations and statistical results.
- Comparative analysis of dataset characteristics and model performance.
- Insights on decision tree behavior across datasets.
This project is for educational purposes only. Datasets are used under their respective public licenses.