Project 2: Decision Tree Classifier for Real-World Datasets

Overview

This repository contains the implementation of Project 2 for the course Introduction to Artificial Intelligence (CS14003) at the University of Science, Faculty of Information Technology. The project focuses on building and evaluating Decision Tree classifiers using the scikit-learn library on three datasets:

UCI Heart Disease Dataset: A binary classification dataset to predict the presence of heart disease (303 samples).
Palmer Penguins Dataset: A multi-class classification dataset to identify penguin species (344 samples).
Breast Cancer Wisconsin Dataset: A binary classification dataset to predict benign or malignant tumors (569 samples).

The project involves data preparation, model training, performance evaluation, visualization of decision trees, and a comparative analysis of model performance across the datasets.

Project Structure

project2_decision_tree/
│
├── notebooks/
│   ├── heart_disease.ipynb       # Analysis for UCI Heart Disease dataset
│   ├── palmer_penguins.ipynb    # Analysis for Palmer Penguins dataset
│   ├── breast_cancer.ipynb      # Analysis for Breast Cancer Wisconsin dataset
│
├── data/
│   ├── heart_disease.csv        # UCI Heart Disease dataset (optional)
│   ├── penguins.csv             # Palmer Penguins dataset (optional)
│   ├── breast_cancer.csv        # Breast Cancer Wisconsin dataset (optional)
│
├── report/
│   ├── project_report.pdf       # Final report with analysis and insights
│
├── README.md                    # This file
└── requirements.txt             # Dependencies for pip

Requirements

The project requires Python 3.8+ and the following libraries:

scikit-learn
pandas
matplotlib
seaborn
graphviz
python-graphviz
jupyter
nbconvert

Setup Instructions

Using pip

Install Python:
- Download Python 3.8+ from python.org.
- Verify:
```
python --version
```

Create a virtual environment (recommended):

python -m venv decision_tree_env
source decision_tree_env/bin/activate  # On Windows: decision_tree_env\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Or manually:

pip install scikit-learn pandas matplotlib seaborn jupyter graphviz nbconvert

Install Graphviz:
- Download from graphviz.org.
- Add Graphviz to PATH (e.g., C:\Program Files\Graphviz\bin on Windows).
- Verify:
```
dot -V
```

Verify libraries:

import sklearn, pandas, matplotlib, seaborn, graphviz
print("Libraries installed successfully!")

How to Run

Clone the repository:

git clone https://github.com/Low-rain-falls/CSC14003_Decision_Tree.git
cd CSC14003_Decision_Tree

Activate the environment:

pip:

python3 -m venv venv
source venv/bin/activate

Launch Jupyter Notebook:
```
jupyter notebook
```
Run notebooks:
- Open notebooks/heart_disease.ipynb, palmer_penguins.ipynb, or breast_cancer.ipynb in the Jupyter interface.
- Execute all cells to perform the analysis.

Export to PDF (optional):

jupyter nbconvert --to pdf notebooks/[notebook_name].ipynb

Datasets

UCI Heart Disease:
- Source: UCI Machine Learning Repository
- Description: 303 samples, 13 features, binary classification (0: no heart disease, 1: heart disease).
- Access: Loaded via URL or data/heart_disease.csv.
Palmer Penguins:
- Source: seaborn or Palmer Penguins
- Description: 344 samples, 6 features, 3 classes (Adelie, Chinstrap, Gentoo).
- Access: Loaded via seaborn.load_dataset('penguins') or data/penguins.csv.
Breast Cancer Wisconsin:
- Source: UCI Machine Learning Repository
- Description: 569 samples, 30 features, binary classification (benign or malignant).
- Access: Loaded via sklearn.datasets.load_breast_cancer() or data/breast_cancer.csv.

Results

Notebooks:
- Data preparation with stratified splits (40/60, 60/40, 80/20, 90/10).
- Decision tree training and visualization using Graphviz.
- Evaluation with classification reports and confusion matrices.
- Analysis of max_depth impact on accuracy (80/20 split).
Report (report/project_report.pdf):
- Visualizations and statistical results.
- Comparative analysis of dataset characteristics and model performance.
- Insights on decision tree behavior across datasets.

License

This project is for educational purposes only. Datasets are used under their respective public licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
images		images
notebooks		notebooks
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 2: Decision Tree Classifier for Real-World Datasets

Overview

Table of Contents

Project Structure

Requirements

Setup Instructions

Using pip

How to Run

Datasets

Results

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

My-undergraduate-studies/CSC14003_Decision_Tree

Folders and files

Latest commit

History

Repository files navigation

Project 2: Decision Tree Classifier for Real-World Datasets

Overview

Table of Contents

Project Structure

Requirements

Setup Instructions

Using pip

How to Run

Datasets

Results

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages