Seyeon An July 20, 2021

<aside> ๐Ÿ”— We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Continual learning (lifelong learning), a long standing open problem in machine learning, refers to the ability of a model to learn continually from a stream of data, accommodating new knowledge and retaining previously learned experiences.

It is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available any more during training new ones.

Continual Learning and the Plasticity-Stability Dilemma

The main challenge of continual learning is the plasticity-stability dilemma: if the model focuses too much on the stability, it suffers from poor forward transfer to the new task, and if it focuses too much on the plasticity, it suffers from the catastrophic forgetting of past tasks.

Regularization based Continual Learning

To address this dilemma, a comprehensive study for neural network-based continual learning was conducted broadly under the following categories: regularization-based, dynamic architecture-based, and replay memory-based methods, as displayed in the images below.

Common approaches for task-incremental learning: Regularization, Dynamic Architecture, Memory Replay [From Left to Right]

Common approaches for task-incremental learning: Regularization, Dynamic Architecture, Memory Replay [From Left to Right]

Our focus is the regularization-based methods, since they pursue to use the fixed-capacity neural network model as efficiently as possible, which may potentially allow them to be combined with other approaches.

These methods typically identify important learned weights for previous tasks and heavily penalize their deviations while learning new tasks.

A typical loss function form for a learning task $t$ looks like this:

$$ \mathcal{L}{t}(\boldsymbol{\theta})=\mathcal{L}{\mathrm{TS}, t}(\boldsymbol{\theta})+\sum_{i} \lambda_{i}\left(\theta_{i}-\hat{\theta}_{t-1, i}\right)^{2} $$

<aside> ๐Ÿ’ก $\mathcal{L}_{\mathrm{TS}, t}(\boldsymbol{\theta})$ = Task-specific loss

$\lambda_{i}$ = Adaptive regularization for weight importance

$\hat{\theta}_{t-1, i}$ = Weight learned up to task $t-1$

$\theta_{i}$ = Weight of current task

</aside>

Intuitively, the loss function is calculated with the sum of task-specific loss $\mathcal{L}{\mathrm{TS}, t}(\boldsymbol{\theta})$ and the product of adaptive regularization for weight importance $\lambda{i}$ and the weight difference of the current and past model $\theta_{i}-\hat{\theta}_{t-1, i}$, of each parameter.