1.Introduction

cross-entropy loss는 supervised learning에서 많이 사용됨
- 이는 label distribution과 empirical distribution의 KL-divergence로 정의
cross-entropy를 개선하기 위한 방법으로 loss의 정의를 완화
- e.g. reference distribution은 axis-aligned
- Label smoothing : fuzzy distinction between correct and incorrect labels by moving off-axis
- Self-distillation : multiple rounds of cross-entropy training
- Mixup : create explicit new training examples and apply the same linear interpolation to the target distribution
같은 클래스는 가깝게하고 다른 클래스는 멀리하는 새로운 supervised training loss 제안
self-supervised learning에서 좋은 성능을 보이고 metric learning과 많은 연관이 있는 contrastive objective functions
contrastive loss는 두 개의 "opposing force"로 구성
- anchor point가 주어짐
- 첫 번째 force는 anchor를 다른 point들과 가깝게 당김 : positives
- 두 번째 force는 anchor를 다른 point들과 멀게 밀어냄 : negatives
self-supervised contrastive learning에서 single positive를 사용한 것과 달리, 이 논문에서는 many positive들을 고려함
Auto-Augment를 사용한 ResNet-50의 결과, cross-entropy loss를 사용한 것보다 supervised contrastive loss를 사용한 경우가 1.6% 높았음
Main contributions
1. anchor당 multiple positives를 적용한 contrastive loss를 사용하여 full supervised setting에서 contrastive learning을 진행
2. cross-entropy와 비교했을 때 top-1 accuracy와 robustness에서 state of the art 기록
3. cross-entropy보다 hyperparameter 범위에 덜 민감
4. hard positive와 hard negative의 학습을 촉진하는 gradient + single positive와 negative가 사용되었을 때 triplet loss와 연관성

2. Related Work

self-supervised representation learning + metric learning + supervised learning
cross-entropy loss는 deep networks를 학습하기 위한 powerful한 loss function
- 왜 target label이 optimal이어야 하는지 명확하지 않음
- 더 좋은 target label vector가 존재함이 증명됨 (Deep representation learning with target coding)
cross-entropy loss 다른 단점들을 연구
- sensitivity to noisy labels
  - Generalized cross entropy loss for training deep neural networks with noisy labels
  - Training convolutional networks with noisy labels
- adversarial examples
  - Large margin deep networks for classification
  - Cross-entropy loss and low-rank features have responsibility for adversarial examples
- poor margins
  - Learning imbalanced datasets with label-distribution-aware margin loss
다른 loss들이 제안되었지만, reference label distribution을 바꾸는 것이 더 유명하고 효율적인 방법
- Label Smoothing
- Mixup
- CutMix
- Knowledge Distillation
최근에 self-supervised representation learning이 각광받는 중
- language domain
  - pre-trained embedding
    - Bert: Pre-training of deep bidirectional transformers for language understanding
    - Xlnet: Generalized autoregressive pretraining for language understanding
    - Distributed representations of words and phrases and their compositionality
  - Downstream fine-tuning이 sentiment classification과 question answering에서 좋은 성능을 보임
  - 많은 양의 unlabeled data를 아주 큰 architecture와 함께 사용할 수 있게 됨
- image domain
  - embedding을 배우기 위해 사용
    - Unsupervised visual representation learning by context prediction
    - Colorful image colorization
    - Split-brain autoencoders: Unsupervised learning by cross-channel prediction
    - Unsupervised learning of visual representations by solving jigsaw puzzles
    - 가려진 signal부분을 가려지지 않은 부분으로 예측
    - 이미지처럼 high dimensional signal에서는 매우 어려움
    - input space의 dense per-pixel predictive loss를 lower-dimensional representation space의 loss로 바꾸는 것도 방법
  - self-supervised representation learning은 contrastive learning으로 바뀜
    - noise contrastive estimation
      - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
      - Learning word embeddings efficiently with noise-contrastive estimation
    - N-pair loss
      - Improved deep metric learning with multi-class n-pair loss objective
    - 학습할 때 deep network의 마지막 레이어에 loss를 적용하고, 테스트 시 downstream transfer task, fine tuning, direct retrieval task를 위해 이전 레이어를 활용
  - contrastive learning은 metric learning과 triplet loss와 연관있음
    - 공통점은 powerful representation을 학습한다는 것
    - triplet loss와 contrastive loss의 차이점은 data point당 positive, negative pair의 수
      - triplet loss
        
        one positive and one negative pair
      - supervised metric learning
        
        positive는 같은 클래스에서, negative는 다른 클래스에서 (hard negative mining)
        
        Facenet: A unified embedding for face recognition and clustering
      - self-supervised contrastive loss
        
        one positive pair selected using either co-occurence or using data augmentation
        
        가장 다른 점은 많은 negative pair들이 각 data point에서 사용된다는 점
  - supervised contrastive와 가장 유사한 것은 soft-nearest neighbor loss
    - 공통점 : embedding을 normalize, euclidean distance를 inner product로 교체
    - 개선 : data augmentation, disposable contrastive head, two-stage training
    - mini batch 내에서 contrasting하여 나온 approximation(loss의 일부분을 backpropagation, memory bank형태의 stale representation)들을 해소 (Improving generalization via scalable neighborhood component analysis)
    - 중간 레이어들을 maximizing하여 클래스들을 entangle한 것과 반대로, 마지막 레이어에서 클래스들을 disentangle (Analyzing and improving representations with the soft nearest neighbor loss)

3. Method

최근 self-supervised representation learning에 사용된 contrastive learning loss 점검
- Representation learning with contrastive predictive coding (CPC)
- Data-efficient image recognition with contrastive predictive coding (CPC2)
- Contrastive multiview coding (CMC)
- A simple framework for contrastive learning of visual representations (SimCLR)
- self-supervised approach를 보존하는 동시에, fully supervised learning에 적합한 방향으로 어떻게 수정했는지 설명
- 사실 self-supervision과 full supervisiln의 중간은 semi-supervision이지만, 여기서는 다루지 않을 예정

3.1. Represenation Learning Framework

self-supervised contrastive learning을 사용한 CMC와 SimCLR과 구조적으로 유사함