This repository contains MATLAB code for colorectal cancer (CRC) tissue classification using wavelet scattering features combined with Support Vector Machine (SVM) classifier. The code is designed for reproducibility and documentation of our research observations.
-
Main_CRC_Classification.m - Main classification pipeline that performs:
- Dataset loading and exploration
- Train/test splitting
- Wavelet scattering feature extraction
- Multi-class SVM classifier training
- Cross-validation and performance evaluation
- Results visualization and saving
-
Scattering_Path_Analysis_FINAL.m - Wavelet scattering path analysis:
- Documents the scattering network structure
- Retrieves and tabulates scattering paths
- Provides frequency/scale/rotation metadata
- Saves path information for reference
-
Vizualize_Wavelets.m - Wavelet filter visualization:
- Generates 2D Morlet wavelet filters
- Displays colorized complex wavelets (phase and magnitude)
- Shows scaling function (low-pass filter)
- publication_scattering_disks/ - Standalone implementation for generating publication-quality scattering disk visualizations from H&E stained histopathology images. Includes a complete self-contained toolbox with wavelet filters, scattering transforms, and visualization helpers.
- figures/ - Contains generated visualizations (confusion matrices, performance plots, sample images, wavelet visualizations)
- wavelet_scattering_paths/ - Contains saved scattering path metadata (CSV and MAT files)
- CRC_Classification_Results.mat - Saved classification results including trained model, features, and metrics
- MATLAB R2019b or later recommended
- Wavelet Toolbox - For wavelet scattering transform
- Statistics and Machine Learning Toolbox - For SVM classifier
- Parallel Computing Toolbox (optional) - For faster processing
- Image Processing Toolbox - For image handling
- Kather Texture 2016 Image Tiles (5000 histology images)
- Download from: https://zenodo.org/record/53169
- Dataset contains 8 tissue classes:
- Tumor epithelium
- Simple stroma
- Complex stroma
- Immune cells
- Debris
- Normal mucosal glands
- Adipose tissue
- Background
Download the Kather texture dataset from Zenodo and extract to a known location on your system.
Open Main_CRC_Classification.m and modify line 14:
DATA_PATH = "YOUR/PATH/TO/Kather_texture_2016_image_tiles_5000";
Check if required toolboxes are installed in MATLAB:
ver wavelet ver stats ver images ver parallel
Run the main classification script:
Main_CRC_Classification
The script will:
- Load and explore the data
- Set up parallel computing (if available)
- Split data into train/test sets (80/20 by default)
- Extract wavelet scattering features
- Train SVM classifier with polynomial kernel
- Perform 10-fold cross-validation
- Test on held-out data
- Generate visualizations
- Save results to CRC_Classification_Results.mat
After running the main classification, analyze the feature space:
Feature_Space_Analysis
This generates t-SNE visualizations showing how well different tissue classes separate in the feature space.
To understand the wavelet scattering network structure:
Scattering_Path_Analysis_FINAL
This documents all scattering paths and saves detailed metadata about the wavelet filters.
To visualize the Morlet wavelet filters:
Vizualize_Wavelets
This creates plots showing the 2D wavelets at different scales and orientations.
The main classification script uses a CONFIG structure with the following default parameters:
- train_ratio: 0.8 (80% training, 20% testing)
- k_fold: 10 (10-fold cross-validation)
- image_size: [150 150] (input image dimensions)
- invariance_scale: 20 (scattering invariance parameter)
- quality_factors: [1 1] (wavelet quality factors)
- num_rotations: [6 6] (number of rotation angles per scale)
- svm_kernel: 'polynomial' (SVM kernel type)
- svm_poly_order: 3 (polynomial kernel order)
- svm_kernel_scale: 'auto' (kernel scale parameter)
- svm_box_constraint: 1 (SVM regularization parameter)
To modify parameters, edit the buildConfig() function in Main_CRC_Classification.m before running.
The script provides detailed progress messages for each processing step including:
- Total images loaded and class distribution
- Training/testing set sizes
- Feature extraction progress and dimensions
- Training time
- Cross-validation accuracy
- Test set accuracy
- Per-class performance metrics
Generated figures (saved to figures/ directory):
- Sample_Images.png - Random sample of 20 images from the dataset
- Confusion_Matrix.png - Shows classification performance with row and column normalization
- PerClass_Performance.png - Bar chart of precision, recall, and F1-score for each tissue class
Additional visualizations from other scripts:
- t-SNE plots showing feature space separability (from Feature_Space_Analysis.m)
- Wavelet filter visualizations (from Vizualize_Wavelets.m)
CRC_Classification_Results.mat contains:
- config: All configuration parameters
- random_seed: Random seed used (100)
- trainfeatures and testfeatures: Extracted feature matrices
- normalization: Mean and standard deviation used for z-score normalization
- classifier: Trained SVM model
- predicted_labels and actual_labels: Predictions and ground truth
- accuracy: Test set accuracy
- crossval_accuracy: Cross-validation accuracy
- confusion_matrix: Confusion matrix
- precision, recall, f1_score: Per-class performance metrics
- extraction_time, training_time, cv_time, prediction_time: Timing information
To load and use saved results:
load('CRC_Classification_Results.mat'); fprintf('Test Accuracy: %.2f%%\n', results.accuracy); new_predictions = predict(results.classifier, new_features);
The code is designed for reproducibility with the following features:
- Fixed random seed (RANDOM_SEED = 100) applied consistently
- All random operations use the same seed
- Results saved to file for later analysis
- Configuration parameters centralized in CONFIG structure
- Deterministic train/test splitting
To reproduce exact results:
- Use the same random seed (already set to 100)
- Use MATLAB R2019b or later is recommended
- Keep default CONFIG parameters
- Accuracy: Percentage of correctly classified samples
- Precision: Of all samples predicted as class X, how many were actually class X
- Recall: Of all actual class X samples, how many were correctly identified
- F1-Score: Harmonic mean of precision and recall, useful for balanced evaluation
- Feature extraction: 2.42 minutes (all train and test images)
- Training: 23.82 seconds
- Prediction: 0.04 seconds
I would like to thank Prof. Murray Loew and Dr. Elliot Levy for their guidance and mentorship throughout this research project.