Skip to content

My-undergraduate-studies/CSC14003_Decision_Tree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project 2: Decision Tree Classifier for Real-World Datasets

Overview

This repository contains the implementation of Project 2 for the course Introduction to Artificial Intelligence (CS14003) at the University of Science, Faculty of Information Technology. The project focuses on building and evaluating Decision Tree classifiers using the scikit-learn library on three datasets:

  • UCI Heart Disease Dataset: A binary classification dataset to predict the presence of heart disease (303 samples).
  • Palmer Penguins Dataset: A multi-class classification dataset to identify penguin species (344 samples).
  • Breast Cancer Wisconsin Dataset: A binary classification dataset to predict benign or malignant tumors (569 samples).

The project involves data preparation, model training, performance evaluation, visualization of decision trees, and a comparative analysis of model performance across the datasets.

Table of Contents

  1. Project Structure
  2. Requirements
  3. Setup Instructions
  4. How to Run
  5. Datasets
  6. Results
  7. Team Members
  8. License

Project Structure

project2_decision_tree/
│
├── notebooks/
│   ├── heart_disease.ipynb       # Analysis for UCI Heart Disease dataset
│   ├── palmer_penguins.ipynb    # Analysis for Palmer Penguins dataset
│   ├── breast_cancer.ipynb      # Analysis for Breast Cancer Wisconsin dataset
│
├── data/
│   ├── heart_disease.csv        # UCI Heart Disease dataset (optional)
│   ├── penguins.csv             # Palmer Penguins dataset (optional)
│   ├── breast_cancer.csv        # Breast Cancer Wisconsin dataset (optional)
│
├── report/
│   ├── project_report.pdf       # Final report with analysis and insights
│
├── README.md                    # This file
└── requirements.txt             # Dependencies for pip

Requirements

The project requires Python 3.8+ and the following libraries:

  • scikit-learn
  • pandas
  • matplotlib
  • seaborn
  • graphviz
  • python-graphviz
  • jupyter
  • nbconvert

Setup Instructions

Using pip

  1. Install Python:

    • Download Python 3.8+ from python.org.

    • Verify:

      python --version
  2. Create a virtual environment (recommended):

    python -m venv decision_tree_env
    source decision_tree_env/bin/activate  # On Windows: decision_tree_env\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

    Or manually:

    pip install scikit-learn pandas matplotlib seaborn jupyter graphviz nbconvert
  4. Install Graphviz:

    • Download from graphviz.org.

    • Add Graphviz to PATH (e.g., C:\Program Files\Graphviz\bin on Windows).

    • Verify:

      dot -V
  5. Verify libraries:

    import sklearn, pandas, matplotlib, seaborn, graphviz
    print("Libraries installed successfully!")

How to Run

  1. Clone the repository:

    git clone https://github.com/Low-rain-falls/CSC14003_Decision_Tree.git
    cd CSC14003_Decision_Tree
  2. Activate the environment:

    • pip:

      python3 -m venv venv
      source venv/bin/activate
  3. Launch Jupyter Notebook:

    jupyter notebook
  4. Run notebooks:

    • Open notebooks/heart_disease.ipynb, palmer_penguins.ipynb, or breast_cancer.ipynb in the Jupyter interface.
    • Execute all cells to perform the analysis.
  5. Export to PDF (optional):

    jupyter nbconvert --to pdf notebooks/[notebook_name].ipynb

Datasets

  1. UCI Heart Disease:

    • Source: UCI Machine Learning Repository
    • Description: 303 samples, 13 features, binary classification (0: no heart disease, 1: heart disease).
    • Access: Loaded via URL or data/heart_disease.csv.
  2. Palmer Penguins:

    • Source: seaborn or Palmer Penguins
    • Description: 344 samples, 6 features, 3 classes (Adelie, Chinstrap, Gentoo).
    • Access: Loaded via seaborn.load_dataset('penguins') or data/penguins.csv.
  3. Breast Cancer Wisconsin:

    • Source: UCI Machine Learning Repository
    • Description: 569 samples, 30 features, binary classification (benign or malignant).
    • Access: Loaded via sklearn.datasets.load_breast_cancer() or data/breast_cancer.csv.

Results

  • Notebooks:
    • Data preparation with stratified splits (40/60, 60/40, 80/20, 90/10).
    • Decision tree training and visualization using Graphviz.
    • Evaluation with classification reports and confusion matrices.
    • Analysis of max_depth impact on accuracy (80/20 split).
  • Report (report/project_report.pdf):
    • Visualizations and statistical results.
    • Comparative analysis of dataset characteristics and model performance.
    • Insights on decision tree behavior across datasets.

License

This project is for educational purposes only. Datasets are used under their respective public licenses.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •