Constrained Reweighting for Training Deep Neural Nets with Noisy Labels

My Main Takeaways

loss function indicates how well the current parameters fit to given data
- noisy labels have higher loss
each instance/observation is treated equally when updating parameters
- if an observation is an outlier, it is still treated with the same weight
solution
- instead of updating parameters with uniform weights for all instances, update with a weighted sum of INDIVIDUAL losses
- have a weighted vector representation of label class vector instead of a one-hot encoder

Introduction

modern DNNs have more trainable model parameters than the number of training examples
- results in overfitting the noisy data or corrupted labels
some reasons for noisy labels:
- errors and inconsistencies in manual annotation
- use of inherently noisy label sources
representations —> learned by pre-trained large models with noisy data
- can be be useful for prediction when used in a linear classifier trained with clean data
can train data directly on noisy data if:
1. data fit easily into standard training pipelines with little computational or memory overhead
2. should be applicable in data “streaming” settings
3. should not require data with clean labels
Constrained Instance reWeighting (CIW)
- dynamically assign importance weights both to individual instances and to class labels in a mini-batch
  - goal —> reduce the effect of potentially noisy examples
- formulate a family of constrained optimization problems
  - optimization problems are solved per mini-batch
    - avoids the need to store and update the importance weights
- provides a theoretical perspective for existing label smoothing heuristics that address label noise (label bootstrapping)

Method

training ML models involves minimizing a loss function
- loss function indicates how well the current parameters fit to given data
each step, the loss is approximately calculated as a weighted sum of the losses of individual instances in the mini-batch of data which is operating
each instance is treated equal for updating model parameters
- assigns uniform weights across mini-batches
noisy or mislabeled instances tend to have higher loss values than ones that are clean
assigning uniform importance weights to all instances might degrade accuracy when there are noisy instances
SOLUTION
- create a family of constrained optimization problems
- assign importance weights to individual instances in dataset to reduce effect of those that are likely to be noisy
- controls how much the weights deviate from uniform
- quantified by divergence measure
- final loss —> weighted sum of individual instance losses which is used for updating the model parameters (CIW)
  
  Schematic of the proposed Constrained Instance reWeighting (CIW) method. (src: paper linked above)
CIW method re-weights instances in each mini-batch based on their corresponding loss values
- assigns larger weights to clean instances
assigns importance weights over all possible class labels
- instead of one-hot encoding classes, have a vector with non-zero
- Constrained Instance and Class reWeighting (CICW)
  - static label bootstrapping (label smoothing)
- obtain instance weights with mixup
  - mixup: popular method for regularizing models
  - samples a pair of examples from original dataset and generating a new artificial example using a random convex combination
  - minimizes the loss on these mixed-up data points
  - vanilla mixup treat clean and noisy data equally
  - instead use weights obtained from CIW to do a biased sampling mixup
    - bias mix-up towards clean data points CICW-Mixup
  - applied methods with varying amounts of synthetic noise on CIFAR-10 and -100