- Project Overview
- Dataset
- Methodology
- Tabular Data
- Image Data
- Ensemble Models
- Evaluation
- Results
This repository contains the implementation for our Machine learning project, Skin Cancer Detection with 3D-TBP, developed for the ISIC 2024 Challenge. The goal is to build machine learning models that can accurately distinguish between malignant and benign skin lesions using both image and metadata, even in low-quality smartphone-like photographs.
- ISIC 2024 Challenge Dataset
- Over 400,000 images of individual skin lesions.
- Metadata with 54 features, including patient demographics, lesion characteristics, and diagnostic labels.
- External Datasets
- Malignant cases are significantly underrepresented.
- Preprocessing Techniques
- Handling Missing Values: Imputation (e.g., mean filling) for numerical features.
- Categorical Features: One-hot encoding for variables like sex and lesion location.
- Normalization: Scaling features to ensure balanced contributions.
- Feature Selection: Correlation matrix and PCA for dimensionality reduction.
- Balancing Classes: Applied SMOTE and ADASYN to oversample malignant cases.
- Models Explored
- Random Forest, Extra Trees, XGBoost, and LightGBM.
- LightGBM and XGBoost achieved the best results.
- Optimization
- Cross-validation (5-fold stratified).
- Bayesian optimization for hyperparameter tuning.
-
Preprocessing Techniques
- Hair Removal: Used the DullRazor Algorithm to remove hair artifacts.
- Image Resizing: All images resized to 224x224 pixels to ensure uniform input.
- Data Augmentation: 1. Random horizontal and vertical flips. 2. Random resized cropping.
- Normalization: Applied mean and standard deviation values to match pre-trained model requirements.
-
Models Explored
-
Cross-validation
Incorporated stratified cross-validation to improve robustness and minimize overfitting.
Combined predictions from tabular and image models for better overall performance.
- Arithmetic Mean
- Geometric Mean
- Soft Voting
- Stacking
- Bagging
-
Metric: Partial Area Under the ROC Curve (pAUC) above an 80% True Positive Rate (TPR). Hence, scores range from [0.0, 0.2].
-
Implementation: ISIC pAUC above TPR
- Best Model: XGBoost
- Results on Kaggle Submission



