$\mathbf{x}$, $\mathbf{y}$ : an input image, its one-hot encoded ground truth label
$\{\Theta_l (\cdot)\}^L_{l=1}$ : successive $L$ convolution blocks
$\mathbf{X}^l \in \mathbb{R}^{C_l \times H_l \times W_l}$ : intermediate feature maps
$\hat{\mathbf{y}} \in [0,1]^K$ : the final normalized output logits
conventional GAP-FC based output layer $O_\text{GAP-FC} (\cdot)$
$$ \hat{\mathbf{y}} = O_\text{GAP-FC} (\mathbf{X}^L) = \text{softmax}((\bar{\mathbf{x}}^L_\text{GAP})^T \mathbf{W}^{FC}) \tag{1} $$
Spatially Attentive Output Layer (SAOL, $O_\text{SAOL} (\cdot)$)
Spatial Attention Map ($\mathbf{A} \in [0,1]^{H_o \times W_o}$)와 Spatial Logis ($\mathbf{Y} \in [0,1]^{K \times H_o \times W_o}$)를 각각 생성 ($H_o = H_L$, $W_o = W_L$)
$$ \hat{\mathbf{y}}k = O\text{SAOL, k}(\mathbf{X}^L) = \sum_{i,j} \mathbf{A}_{ij}(\mathbf{Y}k){ij},\ \forall k, \tag{2} $$