❓ Why CNN Models? Why ResNet-50 over DenseNet-121?
We chose CNN models as they have proven effective in medical image classification, especially where anatomical structures need to be spatially interpreted.
Among the tested architectures, ResNet-50 provided a good balance between model complexity, training time, and interpretability, which was critical for our Grad-CAM–based explainability.
Although DenseNet-121 showed slightly higher accuracy, ResNet-50 was chosen for final Grad-CAM visualization because:
- Its skip connections make attention maps more localized and interpretable.
- It had more stable convergence during cross-validation.
- And it is widely adopted in clinical imaging research, supporting reproducibility.
❓No augmentation. Why?
We deliberately did not apply data augmentation in this experiment because:
- The dataset is anatomically sensitive. Flipping, rotation, or shifting could disrupt spatial orientation (e.g., left vs. right cheek, superior vs. inferior).
- Our primary goal was model interpretability, not just accuracy.
- We focused on inherent spatial patterns, as learned from unaltered, real patient data.
In future work, domain-specific augmentation strategies could be explored—such as intensity variation or speckle simulation.
❓Why only Grad-CAM?
We used Grad-CAM because:
- It’s a widely accepted, model-agnostic method for visualizing CNN attention.
- It allows direct mapping of class-discriminative regions, which we can compare with anatomical landmarks like the Buccinator muscle or Zygomaticus region.