1. Introduction

modified conv block에 대한 연구는 많지만, 정작 classification logit을 구하기 위해 GAP + FC layer를 사용하는건 변하지 않음
- localization ability를 유지할 수 있는 몇몇 연구들이 있었지만, classification의 logit 자체가 localization에 대한 이점이 별로 없음
localized class-specific responses
- visual explanation을 통해 CNN의 decision making을 확인
- label과 semantically 관련있는 부분에 집중하는 spatial attention mechanism
- generalization ability를 강화하는 spatial transformation에 기반한 auxiliary self-supervised loss 또는 task

2. Related Work

$\mathbf{x}$, $\mathbf{y}$ : an input image, its one-hot encoded ground truth label
$\{\Theta_l (\cdot)\}^L_{l=1}$ : successive $L$ convolution blocks
$\mathbf{X}^l \in \mathbb{R}^{C_l \times H_l \times W_l}$ : intermediate feature maps
$\hat{\mathbf{y}} \in [0,1]^K$ : the final normalized output logits
conventional GAP-FC based output layer $O_\text{GAP-FC} (\cdot)$

$$ \hat{\mathbf{y}} = O_\text{GAP-FC} (\mathbf{X}^L) = \text{softmax}((\bar{\mathbf{x}}^L_\text{GAP})^T \mathbf{W}^{FC}) \tag{1} $$
- $\bar{\mathbf{x}}^L_\text{GAP} \in \mathbb{R}^{C_L \times 1}$ : the spatially aggregated feature vector by GAP
- $\mathbf{W}^{FC} \in \mathbb{R}^{C_L \times K}$ : the weight matrix of the output FC layer
Spatially Attentive Output Layer (SAOL, $O_\text{SAOL} (\cdot)$)
- Spatial Attention Map ($\mathbf{A} \in [0,1]^{H_o \times W_o}$)와 Spatial Logis ($\mathbf{Y} \in [0,1]^{K \times H_o \times W_o}$)를 각각 생성 ($H_o = H_L$, $W_o = W_L$)
  - attention value는 softmax로 normalize
    - $\sum_{i,j} \mathbf{A}_{ij} = 1,\ \forall k$
    - $\sum_k (\mathbf{Y}k){ij} = 1,\ \forall i,j$
  $$ \hat{\mathbf{y}}k = O\text{SAOL, k}(\mathbf{X}^L) = \sum_{i,j} \mathbf{A}_{ij}(\mathbf{Y}k){ij},\ \forall k, \tag{2} $$
  - $\hat{\mathbf{y}}k$ : the output logit of the $k{th}$ class

Spatial Attention Map $\mathbf{A}$
- 마지막 conv feature map $\mathbf{X}^L$을 2개의 conv + softmax에 넣음
Spatial Logits $\mathbf{Y}$
- 여러 크기의 spatial logit을 합침
- resize → conv → concatenate → conv + softmax
interpretable attention output 또는 target object location을 얻을 수 있음