1. Introduction

Figure 1.

Figure 1.

2. Related Works

3. Method

3.1. Pixel-level Contrastive Learning

SOTA unsupervised representation learning method들은 instance discrimination의 pretext task로 만들어짐 → PixContrast : pixel level에서도 한번 적용해본다
- 같은 이미지로 두 번의 augmentation 수행
  - 동일한 resolution으로 resize 후, regular encoder network와 momentum encoder network에 각각 보냄
    - encoder network : a backbone network + a projection head network
      - backbone network : any image neural network
      - projection head network : two successive 1x1 convolution with baregutchnorm + ReLU → feature map 생성
  - pixel-level pretext task가 적용될 수 있도록 feature map 계산 → feature transfer!

Pixel Contrast

pixel contrast pretext tasks for representation learning
- feature map내 각 pixel들을 original image space로 warping
- 두 feature map의 pixel들의 각 pair끼리 거리 계산 (아마도 euclidean?)
- 거리는 각 feature map의 대각선 길이로 normalize
- threshold $\mathcal{T}$로 positive, negative pair 계산
  
  $$ A(i,j) = \begin{cases} 1, &\text{if dist}(i,j) \leq \mathcal{T}, \\ 0, &\text{if dist}(i,j) > \mathcal{T}, \end{cases} \tag{1} $$
  - $i$, $j$ : pixels from each of the two views
  - $\text{dist}(i, j)$ : the normalized distance between pixel $i$ and $j$ in hte original image space
  - $\mathcal{T}$=0.7
a contrastive loss for representation learning

$$ \mathcal{L}\text{Pix}(i) = -\log\frac{\sum{j \in \Omega^i_p} e^{\text{cos}(\mathbf{x}_i,\mathbf{x}j')/\tau}}{\sum{j \in \Omega^i_p} e^{\text{cos}(\mathbf{x}_i, \mathbf{x}j')/\tau} + \sum{k \in \Omega^i_n} e^{\text{cos}(\mathbf{x}_i, \mathbf{x}_k')/\tau}} \tag{2} $$
- $i$ : a pixel in the first view that is also located in the second view → intersection
- $\Omega^i_p$ and $\Omega^i_n$ : sets of pixels in the second view assigned as positive and negative, repectively
- $\mathbf{x}_i$ and $\mathbf{x}_j'$ : the pixel feature vectors in two vies
- $\tau$ (=0.3) : a scalar temperature hyper-parameter
- loss는 two view의 intersection에 있는 first view의 모든 pixel을 평균 & 미니배치 내 모든 이미지에 대해 평균

3.2. Pixel-to-Propagation Consistency

spatial sensitivity and spatial smoothness는 transfer performance에 영향을 미침
- spatial sensitivity : 공간적으로 인접한 pixel들을 분별하는 정도 → 레이블이 바뀌는 경계 영역의 정확한 예측 요구
- spatial smoothness : 공간적으로 인접한 pixel들이 비슷하도록 유도 → 같은 레이블에 속하는 영역 예측에 에 도움을 줌
- PixContrast는 spatial sensitivity만 진행 → spatial smoothness도 포함하는 새로운 pixel-level pretext task 제안
두 가지 중요한 부분을 포함
- pixel propagation module
  - 비슷한 pixel의 특징을 전달하여 pixel의 특징들을 걸러냄 (denoising/smoothing)
- asymmetric architecture design
  - 한 branch는 일반적인 feature map, 다른 branch는 pixel-propagation module을 포함
→ negative pair 없이, 두 branch의 feature간 consistency를 찾는다!

→ pixel-to-propagation consistency (PPC)

Pixel Propagation Module

pixel propagation module은 pixel feature $\mathbf{x}_i$의 smoothed transform $\mathbf{y}_i$를 같은 이미지 $\Omega$ 안의 다른 pixel feature $\mathbf{x}_j$을 전달하며 계산

$$ \mathbf{y}i = \sum{j \in \Omega} s(\mathbf{x}_i,\mathbf{x}_j) \cdot g(\mathbf{x}_j), \tag{3} $$
- $s(\cdot, \cdot)$ : a similarity function
  
  $$ s(\mathbf{x}_i, \mathbf{x}_j) = (\text{max}(\text{cos}(\mathbf{x}_i, \mathbf{x}_j),0))^\gamma, \tag{4} $$
  - $\gamma$ (=2) : an exponent to control the sharpness of the similarity function
- $g(\cdot)$ : a transformation function that can be instantiated by $l$ linear layers with a batch normalization and a ReLU layer between two successive layers
  - $l = 0$, $g(\cdot)$ : an identity function → Eq. (3) will be a non-parametric module
  - $l = \{0,1,2\}$가 제일 좋았고, 이 실험에선 $l = 1$로 설정

Pixel-to-Propagation Consistency Loss

two different encoders in the asymmetric architecture design
- a regular encoder with the pixel propagation module applied afterwards to produce smoothed features
- a momentum encoder without the propagation module
- 두 encoder로 들어간 두 augmentation view로부터 나온 feature들끼리 consistency 계산
  
  $$ \mathcal{L}_\text{PixPro} = -\text{cos}(\mathbf{y}_i, \mathbf{x}_j') - \text{cos}(\mathbf{y}_j, \mathbf{x}_i'), \tag{5} $$
  - $i$ and $j$ : a positive pixel pair from two augmentation views (Eq. (1))
  - $\mathbf{x}_i'$ and $\mathbf{y}_i$ : pixel feature vectors of the momentum encoder and the propagation encoder
  - 각 이미지의 모든 positive pair로 평균 & 미니 배치 내 모든 이미지로 평균

Comparison to PixContrast

PixContrast Loss in Eq. 2
- a pixel propagation module (PPM)
- contrastive loss를 consistency loss로 교체