Figure 1.
동일한 resolution으로 resize 후, regular encoder network와 momentum encoder network에 각각 보냄
pixel-level pretext task가 적용될 수 있도록 feature map 계산 → feature transfer!
pixel contrast pretext tasks for representation learning
feature map내 각 pixel들을 original image space로 warping
두 feature map의 pixel들의 각 pair끼리 거리 계산 (아마도 euclidean?)
거리는 각 feature map의 대각선 길이로 normalize
threshold $\mathcal{T}$로 positive, negative pair 계산
$$ A(i,j) = \begin{cases} 1, &\text{if dist}(i,j) \leq \mathcal{T}, \\ 0, &\text{if dist}(i,j) > \mathcal{T}, \end{cases} \tag{1} $$
a contrastive loss for representation learning
$$ \mathcal{L}\text{Pix}(i) = -\log\frac{\sum{j \in \Omega^i_p} e^{\text{cos}(\mathbf{x}_i,\mathbf{x}j')/\tau}}{\sum{j \in \Omega^i_p} e^{\text{cos}(\mathbf{x}_i, \mathbf{x}j')/\tau} + \sum{k \in \Omega^i_n} e^{\text{cos}(\mathbf{x}_i, \mathbf{x}_k')/\tau}} \tag{2} $$
spatial sensitivity and spatial smoothness는 transfer performance에 영향을 미침
두 가지 중요한 부분을 포함
→ negative pair 없이, 두 branch의 feature간 consistency를 찾는다!
→ pixel-to-propagation consistency (PPC)
pixel propagation module은 pixel feature $\mathbf{x}_i$의 smoothed transform $\mathbf{y}_i$를 같은 이미지 $\Omega$ 안의 다른 pixel feature $\mathbf{x}_j$을 전달하며 계산
$$ \mathbf{y}i = \sum{j \in \Omega} s(\mathbf{x}_i,\mathbf{x}_j) \cdot g(\mathbf{x}_j), \tag{3} $$
$s(\cdot, \cdot)$ : a similarity function
$$ s(\mathbf{x}_i, \mathbf{x}_j) = (\text{max}(\text{cos}(\mathbf{x}_i, \mathbf{x}_j),0))^\gamma, \tag{4} $$
$g(\cdot)$ : a transformation function that can be instantiated by $l$ linear layers with a batch normalization and a ReLU layer between two successive layers
a regular encoder with the pixel propagation module applied afterwards to produce smoothed features
a momentum encoder without the propagation module
두 encoder로 들어간 두 augmentation view로부터 나온 feature들끼리 consistency 계산
$$ \mathcal{L}_\text{PixPro} = -\text{cos}(\mathbf{y}_i, \mathbf{x}_j') - \text{cos}(\mathbf{y}_j, \mathbf{x}_i'), \tag{5} $$