1. Introduction

the specialized sibling head for both classification and localization
- single stage, two-stage, anchor free 모두 사용됨
- sibling head 내 두 object function사이의 충돌우려
IoU-Net (2018)
- 좋은 classification score를 만드는 feature는 coarse bbox를 예측할 것이다
  - localization confidence로 IoU를 계산하는 extra head추가
  - localization confidence와 classification confidence를 final classification score로 통합
- tight bbox에 대한 confidence score는 높이고 안좋은 것은 줄임
- 하지만 spatial point의 misalignment는 여전히 남아있음
Double-Head RCNN
- sibling head를 classification, localization branch로 분리
- 두 task의 shared parameter를 감소시킴
- 성능은 향상되었지만, 두 branch로 들어가는 feature가 ROI pooling으로 만들어졌기 때문에 두 task간 충돌은 여전히 남아있음
anchor-based object detector의 sibling head를 살펴보자
- classification과 localization의 spatial sensitivity
  - classification : some salient area / bbox regression : boundary
- spatial dimension에서 misalignment는 성능향상에 제한이 있음
task-aware spatial disentanglement (TSD)
- classification과 localization의 gradient flow를 분리하자
- pregressive constraint (PC)를 통해 TSD와 기존 sibling head의 성능차를 벌림
  - hyper-parameter margin

2. Methods

task-aware spatial disentanglement (TSD)

2.1. TSD

the classicial Faster RCNN
- $P$ : a rectangular bounding box proposal
- $\mathcal{B}$ : the ground-truth bounding box with class $y$
$$ \mathcal{L} = \mathcal{L}_{cls}(\mathcal{H}1(F_l, P), y) + \mathcal{L}{loc}(\mathcal{H}_2(F_l, P), \mathcal{B}) \tag{1} $$
- $\mathcal{H}_1(\cdot) = \{f(\cdot), \mathcal{C}(\cdot)\}$ : the classification loss
- $\mathcal{H}_2(\cdot) = \{f(\cdot), \mathcal{R}(\cdot)\}$ : the localization loss
- $f(\cdot)$ : the feature extractor
- $\mathcal{C}(\cdot)$ and $\mathcal{R}(\cdot)$ : the functions for transforming feature to predict specific category and localize object
head-decoupling이 성능은 개선하지만, spatial dimension에서 각 task가 겹쳐서 문제가 발생할 수 있음
- sibling headspatial dimension에서 task를 분리하여 해결해보자
$$ \mathcal{L} = \mathcal{L}^D_{cls} (\mathcal{H}^D_1 (F_l, \hat{P}c), y) + \mathcal{L}^D{loc}(\mathcal{H}^D_2 (F_l, \hat{P}_r), \mathcal{B}) \tag{2} $$
- $\hat{P}_c = \tau_c(P, \Delta C)$ and $\hat{P}_r = \tau_r(P, \Delta R)$ : disentangled proposal. $P$ 공유
- $\tau_*$ : disentangle function
- $\Delta C$ : a pointwise deformation of $P$
- $\Delta R$ : a proposal-wise translation
- $\mathcal{H}^D_1(\cdot) = \{f_c(\cdot), \mathcal{C}(\cdot)\}$ and $\mathcal{H}^D_2(\cdot) = \{f_r(\cdot), \mathcal{R}(\cdot)\}$ in TSD
TSD는 $P$의 RoI feature를 input으로 수행하고, disentangled proposal $\hat{P}_c$와 $\hat{P}_r$을 각각 생성
- 분리된 proposal을 통해 spatial dimension에서 분리 가능
- $\hat{F}_c$ (classificaiton-specific feature map) → a three-layer fully connected networks
- $\hat{F}_r$ (localization-specific feature map) → 위와 비슷하게

2.2. Task-aware spatial disentanglement learning

task-aware spatial disentanglement learning
- Localization
  - $\mathcal{F}_r$ : 새로운 $\hat{P}_r$을 생성하기위해 $P$에서 proposal-wise translation 생성
    
    $$ \Delta R = \gamma \mathcal{F}_r(F;\theta_r) \cdot (w,h) \tag{3} $$
    - $\Delta R \in \mathbb{R}^{1 \times 1 \times 2}$ and the output of $\mathcal{F}_r$ : {256, 256, 2}
    - $\gamma$ : a pre-defined scalar to modulate the magnitude of the $\Delta R$
    - $(w,h)$ : the width and height of $P$
    - The derived function $\tau_r(\cdot)$ for generating $\hat{P}_r$
    $$ \hat{P}_r = P + \Delta R \tag{4} $$
    - the proposal-wise translation
      - the coordinate of each pixel in $P$ → a new coordinate with the same $\Delta R$
      - localization task에만 적용, $\Delta R$이 미분가능하도록 bilinear interpolation
- Classification
  - 불규칙적인 shape의 derived proposal $\hat{P}_c$을 생성하기위해 regular grid $k \times k$에서 pointwise deformation
  - (x,y)-th grid에서 $\hat{P}_c$의 새로운 sample point를 얻기위해 translation $\Delta C(x,y,*)$
    
    $$ \Delta C = \gamma \mathcal{F}_c(F;\theta_c) \cdot (w,h) \tag{5} $$
    - $\Delta C \in \mathbb{R}^{k \times k \times 2}$
    - $\mathcal{F}_c$ : a three-layer fully connected network with output $\{256, 256, k \times k \times 2\}$
    - $\theta_c$ : the learned parameter
- $\mathcal{F}_r$과 $\mathcal{F}_c$의 첫 번째 레이어는 parameter를 줄이기 위해 공유됨
- irregular $\hat{P}_c$에서 feature map $\hat{F}_c$을 생성하기위해 deformable RoI pooling과 같은 연산 진행
  
  $$ \hat{F}c(x,y) = \sum{p \in G(x,y)} \frac{\mathcal{F}_B(p_0 + \Delta C(x,y,1),\ p_1 + \Delta C(x,y,2))}{|G(x,y)|} \tag{6} $$
  - $G(x,y)$ : the (x,y)-th grid
  - $|G(x,y)|$ : the number of sample points in $G(x,y)$
  - $(p_x, p_y)$ : the coordinate of the sample point in grid $G(x,y)$
  - $\mathcal{F}_B$ : the bilinear interpolation to make the $\Delta C$ differentiable

2.3. Progressive constraint

progresseive constraint (PC)
- classification branch
  
  $$ \mathcal{M}_{cls} = |\mathcal{H}1 (y|F_l,P) - \mathcal{H}^D_1(y|F_l, \tau_c(P,\Delta C)) + m_c|+ \tag{7} $$
  - $\mathcal{H}(y|\cdot)$ : the confidence score of the $y$-th class
  - $m_c$ : the predefined margin
  - $| \cdot |_+$ : ReLU
- localization branch
  
  $$ \mathcal{M}_{loc} = |IoU(\hat{\mathcal{B}}, \mathcal{B}) - IoU(\hat{\mathcal{B}}D, \mathcal{B}) + m_r|+ \tag{8} $$
  - $\hat{\mathcal{B}}$ : the predicted box by sibling head
  - $\hat{\mathcal{B}}_D$ : $\mathcal{H}^D_2(F_l, \tau_r(P, \Delta R))$에서 regression
  - $P$가 negative proposal이면, $M_{loc}$은 무시
whole loss function of TSD with Faster RCNN

$$ \mathcal{L} = \underbrace{\mathcal{L}{rpn} + \mathcal{L}{cls} + \mathcal{L}{loc}}{classical\ loss} + \underbrace{\mathcal{L}^D_{cls} + \mathcal{L}^D_{loc} + \mathcal{M}{cls} + \mathcal{M}{loc}}_{TSD\ loss} \tag{9} $$
- TSD는 classification과 localization의 task-specific feature representation을 학습

1. Introduction

2. Methods

2.1. TSD

2.2. Task-aware spatial disentanglement learning

2.3. Progressive constraint

2.4. Discussion in context of related works