
Spatial sensitivity
the ability to discriminate spatially close pixels, needed for accurate prediction in boundary areas where labels change.
Spatial smoothness
encourages spatially close pixels to be similar, which can aid prediction in areas that belong to the same label.
Spatial sensitivity
sampling two augmentation views from the same image. The two views are both resized to a fixed resolution (e.g., 224 × 224)
passing through a regular encoder network (backbone and projection head) and a momentum encoder network
warp feature map to original image, compute normalized (compensate scaling in aug) all pair distances; d ≤ threshold ⇒ positive pair; d > threshold ⇒ negative pair
contrastive loss (logarithmic softmax cosine similarity/dissimilarity)
$$ \mathcal{L}{\text {Pix }}(i)=-\log \frac{\sum{j \in \Omega_{p}^{i}} e^{\cos \left(\mathbf{x}{i}, \mathbf{x}{j}^{\prime}\right) / \tau}}{\sum_{j \in \Omega_{p}^{i}} e^{\cos \left(\mathbf{x}{i}, \mathbf{x}{j}^{\prime}\right) / \tau}+\sum_{k \in \Omega_{n}^{i}} e^{\cos \left(\mathbf{x}{i}, \mathbf{x}{k}^{\prime}\right) / \tau}} $$
the loss is averaged over all pixels on the first view that lie in the intersection of the two views. the loss of pixels on second view is also averaged.
The final loss is the average over all image pairs in a mini-batch.
Spatial sensitivity & Spatial smoothness
For each pixel feature ${\bf x}_i$, the pixel propagation module computes its smoothed transform ${\bf y}_i$ by propagating features from all pixels ${\bf x}_j$ within the same image $Ω$ as