TransBTS: Multimodal Brain Tumor Segmentation Using Transformer(2021.06)

<aside> <img src="/icons/bookmark_red.svg" alt="/icons/bookmark_red.svg" width="40px" /> Abstract

Transformer : global한 feature를 뽑는데 강점을 가짐 (by. self-attention)
3D segmentation과 같은 dense prediction task에서는 local & global한 feature 둘다 잘 뽑아내야 함
TransBTS 모델 제안
- local 문맥을 얻기 위해, encoder는 우선 3D CNN 으로 구성
- global한 feature를 얻기 위해, feature는 Transformer에 token으로 reform하여 넣어준다
- BraTS 2019, 2020에서 뛰어난 성과를 냄 </aside>

1. Introduction

CNN 은 이미지 특징 추출에 좋은 성능을 가지고 있지만, long-range dependency 구축에 어려움을 가짐

이유) convolution kernel이 제한된 recpetive filelds를 가지고 있기 때문 (Locality)

⇒ Global semantic segmentation에 치명적
self-attention 연산
- long-range dependency 포착 가능 ⇒ CNN의 한계 극복
- 시간/공간적 복잡도 매우 높음
연구 동기
- dense prediction task를 위해서는 local & global 정보 둘다 중요
- 이미지를 patch단위로 쪼개서 Transformer연산을 거치면 local 정보 무시됨
- medical volumetric data ⇒ 가 2D이미지끼리 처리 ⇒ 연속된 slices간의 학습 어려움
Transformer in 3D CNN for 3D MRI Brain Tumor Segmentation(TransBTS) 제안
- 3D CNN : 효율적으로 local 3D context 정보 추출
- Transformer : global feature 구축

Untitled

<aside> 🐍 Encoder

3D CNN - 3x3x3 conv를 통해 downsampling 수행
- 점진적으로 저해상도/고차원의 feature($F$) 생성
- $X \in \mathbb{R}^{C\times H \times W \times D}$ ⇒ $F \in \mathbb{R}^{K\times {H\over 8} \times {W\over 8} \times {D\over 8}}$
- 3D context가 효과적으로 $F$로 embedding 된다.
- $F$ 는 Transformer에 들어가, 전체 문맥상 연관성을 학습
**Feature Embedding** of Transformer Encoder
1. **linear projection : 3 x 3 x 3 convolution**
  - 차원 증가 ($K(128)$ ⇒ $d(512)$)
  - $F \in \mathbb{R}^{K\times {H\over 8} \times {W\over 8} \times {D\over 8}}$ ⇒ $F' \in \mathbb{R}^{d\times {H\over 8} \times {W\over 8} \times {D\over 8}}$
2. 공간/깊이 정보 다 펴주기
  - 이유) Transformer는 sequence를 input으로 받으므로,
  - $F' \in \mathbb{R}^{d\times {H\over 8} \times {W\over 8} \times {D\over 8}}$ ⇒ $f \in \mathbb{R}^{d\times ({H\over 8} \times {W\over 8} \times {D\over 8})}$ ⇒ $f \in \mathbb{R}^{d\times N}$ ($N = ({H\over 8} \times {W\over 8} \times {D\over 8})$)
3. position embedding 추가
  - 이유) 위치정보를 넣어주기 위해,
  - $z_0 = f + PE$ ($PE \in \mathbb{R}^{d\times N}$)
Transformer Layer

$L$개의 standard Transformer layers로 구성

</aside>

<aside> 🐍 Decoder

feature mapping
1. sequence한 데이터를 표준 4D feature map으로 reshape
  
  $Z_L \in \mathbb{R}^{d\times N}$ ⇒ $Z_L' \in \mathbb{R}^{d\times {H\over 8} \times {W\over 8} \times {D\over 8}}$
2. 연산량을 줄이기 위해, $C$ 차원 축소
  
  $Z_L' \in \mathbb{R}^{d\times {H\over 8} \times {W\over 8} \times {D\over 8}}$ ⇒ $Z \in \mathbb{R}^{K\times {H\over 8} \times {W\over 8} \times {D\over 8}}$
feature upsampling
- upsampling & conv : $Z \in \mathbb{R}^{K\times {H\over 8} \times {W\over 8} \times {D\over 8}}$ 는 gradually하게 최초의 해상도($R \in \mathbb{R}^{H\times W\times D}$)로 복원
- skip-connection 을 통해, 좀더 세밀한 segmentation mask 생성

</aside>