This project is done as part of the course CS:736 Medical Image Computing at IIT Bombay. It includes recreation of results by the paper HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification
It includes eight different configurations tried and tested including architecture changes, changes in loss and empricial analysis.
The experiments primarily focus on understanding the behavior of hybrid CNN + Transformer distillation under limited data settings.
For more details regarding details please refer to the following presentation : Presentation link
- Distillation weighting
- Feature distillation
- Weighted loss function
- Reverse + Forward KL divergence
- Transformer teacher
- Multi-distill tokens
- Swin Transformer replacement
- Cross-dataset distillation
The original HDKD framework proposes:
- A CNN teacher
- A Hybrid CNN + Transformer student
- Shared CNN blocks between teacher and student
- Combination of:
- Logit Distillation
- Feature Distillation
The student learns:
- Local representations through CNN blocks
- Global representations through the Transformer block
The following datasets were used:
Number of classes: 7
Study of weighting combinations between CLS and Distill token logits.
- Best performance obtained with:
- Distill token weight = 0.7
Experiments performed with:
- 350 samples
- 700 samples
- 2833 samples
Feature maps were visualized every 25 epochs to observe:
- Teacher–student alignment
- Reduction in feature MSE over training
- Distillation consistently improves student performance
- Student features progressively align with teacher features
Instead of using SMOTE for imbalance handling:
- Weighted Cross Entropy Loss was used
- Weighted loss outperformed SMOTE
- Teacher accuracy improved by ~3.5%
Weighted combination of:
- Forward KL (mean-seeking)
- Reverse KL (mode-seeking)
- Slight variations in performance
- No significant improvement over standard KD
Teacher modified to include Transformer blocks.
- Transformer teacher performed worse than CNN teacher
- Likely due to:
- Increased parameter count
- Overfitting under limited data
Additional distill tokens added to learn:
- Stage-2 features
- Stage-3 features
- Teacher logits
- Combining:
- Feature Distillation
- Logit Distillation
- Token Distillation produced the best results
DFLT block replaced with a Swin-style hierarchical transformer.
- Swin performs comparably or slightly better in some settings
- Better handling of locality and hierarchical features
Teacher:
- Pre-trained / fine-tuned on BCN20000
Student:
- Evaluated under limited-data setup
- Distillation from fine-tuned teacher performs better
- Student often outperforms teacher
- Distillation acts as a regularizer and improves generalization