Deep High-Resolution Representation Learning for Visual Recognition

Introduction

High-resolution representation 은 일반적인 vision tasks 를 수행하는데 있어 매우 중요합니다. 그리고 학습된 representation 이 high-resolution 을 유지하면서 여타 다른 중요한 성질들을 잃지 않는 것 또한 중요합니다. 이런 측면에서 HRNet 은 representation 의 특성 2 가지를 얻으려고 합니다. 하나는 segmentation task 의 결과물인 mask 의 resolution 이고 다른 하나는 detection tasks 를 수행하는데 있어 중요한 position sensitivity 입니다.

High-resolution representation: 모델의 결과로 나오는 segmentation mask 가 clean 하지 않으면 후속 task 에 쓸 수 없는 상황이 옵니다. 따라서 high-resolution representation 을 얻는 것은 매우 중요한 일입니다. 하지만 high-resolution representation 을 model pipeline 내내 유지하는 것은 어려운 일입니다. 이는 네트워크의 depth 가 깊어질 수록 1 차원적인 features 는 사라지고 abstract 한 정보만 남는데, 1 차원적인 features 에 segmentation mask 의 detail 을 결정하는 정보들이 남아있기 때문입니다.
Position sensitivity: Object detection 에서 중요한 성질은 translation variance 입니다. 즉 물체의 위치에 민감하게 region of interest 가 변해야 한다는 점입니다.

그러므로 좋은 segmentation 모델은 semantically strong 하고 spatially precise 합니다. HRNet-V2 는 이 두 성질 모두를 만족하기 위해 2 가지 메커니즘을 도입합니다. Parallel multi-resolution convolutions 과 multi-resolution fusions 이 바로 그것입니다.

Figure 1. An example of a high-resolution network

Parallel Multi-Resolution Convolutions

Figure 2. Multi-resolution parallel convolutions

다양한 resolution 의 features 를 학습하는 것은 image 의 context 를 학습하는데 매우 중요합니다. 이는 semantic 정보를 획득하기 위함입니다.

Semantic segmentation 은 instance 마다 label 을 부여하는 개념이 아닌, pixel 마다 category label 을 부여하는 것이기 때문에 다양한 instance 들을 모아 context 로 학습하는 것이 중요합니다. 이 때, 다양한 scale 로 instance 가 존재할 수 있기 때문에 multi-resolution 으로 학습하는 것입니다.

Repeated Multi-Resolution Fusions

Figure 3. Multi-resolution fusion

HRNet 은 Multi-Resolution Fusions 메커니즘을 사용해 position sensitivity 를 달성합니다. 이는 HRNet 이 object detection task 에 있어서도 competitive performance 를 내는 데 주요한 역할을 합니다.

Definition ) To say a function is equivariant means that if the input changes, the output changes in the same way. Specifically, a function $f(x)$ is equivariant to a function g if $f(g(x)) = g(f(x))$.

Position sensitivity 는 equivariance 와 비슷한 의미로, 픽셀에 어떤 transformation 을 가해도 출력 또한 그 transformation 을 가한 것만큼 움직여야 한다는 것입니다. 이를 object detection 에 적용해 보면 픽셀에 translation 을 가하면 bounding box 의 위치 또한 민감하게 변해야 한다는 것으로 풀이할 수 있습니다. 예시로 [1] R-FCN 에서는 position-sensitive RoI pooling 메커니즘을 도입해서 물체의 위치에 민감한 bounding box 를 검출해낸 바 있습니다.

$Figure 4. Visualization of R-FCN($k \times k = 3 \times 3$) for the person category$

Figure 4. Visualization of R-FCN($k \times k = 3 \times 3$) for the person category

$$ r_c(i,j ~|~ \Theta) = \sum_{(x,y)\in \text{bin}(i,j)} z_{i,j,c}(x+x_0, y+y_0 ~|~ \Theta) $$

Tensorflow-style C++ implementation

이와 같이 bounding box 의 대다수의 bin 영역에서 물체에 대한 positive response 가 검출되었을 때 이 bounding box 내부에 물체가 존재한다고 판단합니다. HRNet-V2 는 이와같이 pixel 의 translation 에 translation-variant 한 성질을 가짐을 empirical 하게 증명해냅니다.