| marp | true |
|---|---|
| size | 58140 |
| style | img { display: block; margin: 0 auto; } /* Darker footer text (and subtle background for readability) */ footer { color: #111; font-weight: 600; background: rgba(255, 255, 255, 0.85); padding: 4px 16px; } |
Validation set is too small (only 16 samples)
Combine train/val and split with ratio 0.8:0.2
| Dataset | Total | Normal | Pneumonia | Normal % | Pneumonia % | Ratio |
|---|---|---|---|---|---|---|
| Train | 4185 | 1079 | 3106 | 25.8% | 74.2% | 1:2.88 |
| Validation | 1047 | 270 | 777 | 25.8% | 74.2% | 1:2.88 |
| Test | 624 | 234 | 390 | 37.5% | 62.5% | 1:1.67 |
Leakage Check Pass!
Classes are imbalanced, Use oversampling during training
Most differences are in lung regions where pneumonia shows higher opacity
- Base Model: Resnet 18.
- Loss Function: BCEWithLogitsLoss
- Optimizer: Adam
- Learning Rate Schedule: ReduceLROnPlateau
- Early Stop: quit training after 7 epoch no improvement.
- Oversample: Use WeightedRandomSampler for training set.
Run 5-Fold cross validation to pick best augmentation hyperparameters.

| Configuration | Accuracy | Recall | Specificity | F1 Score | AUC |
|---|---|---|---|---|---|
| Baseline (jitter=0.1, No crop) | 0.8606 | 0.9949 | 0.6368 | 0.8992 | 0.9662 |
| jitter=0.3, crop=0.08-1.0 | 0.9583 | 0.9872 | 0.9103 | 0.9673 | 0.9923 |
| jitter=0.2, crop=0.5-1.0 | 0.9519 | 0.9923 | 0.8846 | 0.9627 | 0.9890 |
Key Observations:
- All models achieve the same high Recall(>0.98), detecting nearly all Pneumonia cases
- The main improvement is in Specificity (Normal class detection): 0.6368 to 0.9103
-
Compute Gradients: Calculate the gradient of the score for class
$c$ ,$y^c$ , with respect to the feature map activations$A^k$ of a convolutional layer: $$ \frac{\partial y^c}{\partial A^k} $$ -
Global Average Pooling (Neuron Importance): These gradients are global-average-pooled to obtain the neuron importance weights
$\alpha_k^c$ : $$ \alpha_k^c = \frac{1}{Z} \sum_i \sum_j \frac{\partial y^c}{\partial A_{ij}^k} $$ This weight$\alpha_k^c$ captures the "importance" of feature map$k$ for a target class$c$ .
-
Weighted Combination: Perform a weighted combination of forward activation maps, followed by a ReLU: $$ L_{Grad-CAM}^c = ReLU\left(\sum_k \alpha_k^c A^k\right) $$
- ReLU is applied because we are only interested in features that have a positive influence on the class of interest. Negative pixels are likely to belong to other categories.
In this fale postive example, we can see the bone misleading the model.
- Segmentation: Using SLIC to generate Superpixels.
- Perturbation: Generate 2000 samples by randomly hiding superpixels.
- Prediction: Get model predictions for all 2000 perturbed images
- Samples Weighting: Weight samples by similarity to original.
- Fitting: Fit interpretable model (e.g., linear regression). Model will pay more attention for high similarity samples.
- Explanation: Superpixels with highest weights are most important
LIME explains a prediction
Where:
-
$f$ : The complex black-box model (e.g., ResNet). -
$g$ : The simple interpretable model (e.g., linear model$g(z) = w \cdot z$ ). -
$\pi_x$ : The proximity measure (kernel) defining the neighborhood around$x$ . -
$\Omega(g)$ : Complexity penalty (e.g., number of non-zero weights). -
$\mathcal{L}$ : The weighted loss function measuring how unfaithful$g$ is to$f$ .
For a linear model
The optimization finds the feature weights
True Positive Example:
False Postive
LIME's key features mainly focus on lung area, and could explain true negative examples better.
| Method | Output | Good for | Main risk | Seen in our results |
|---|---|---|---|---|
| Grad-CAM | Coarse heatmap | Quick localization | Can highlight confounders | FP: bone/spine attention |
| Occlusion | Mask-drop map | Bias / shortcut check | Slow + patch artifacts | Border/non-lung sensitivity |
| LIME | Superpixel weights | Lung-focused, signed clues | Depends on segmentation | TN/FP: lung focus clearer |
Definition: The explanation should concentrate on clinically relevant anatomy (primarily lung fields for pneumonia) rather than borders, markers, ribs, or background.
How to assess
- Quantitative: Compute an “inside-lung attribution ratio”
- Qualitative: A radiologist rates whether highlighted regions correspond to lung opacities/expected patterns vs artifacts (e.g., borders).
Definition: If an explanation claims a region is important, then removing/occluding that region should meaningfully change the model’s output in the predicted direction.
How to assess
- Deletion test: Remove the top-$k%$ highlighted pixels/superpixels (according to the explanation) and measure probability drop.
Definition: Explanations should be consistent under small, clinically irrelevant changes (tiny intensity shifts, minor crops/translation) and across similar images.
How to assess
- Re-run explanations under small perturbations and compute similarity (e.g., Spearman rank correlation of pixel importance, or IoU of top-$k%$ regions).




















