Continual Learning with Node-Importance based Adaptive Group Sparse Regularization

Seyeon An July 20, 2021

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Continual learning (lifelong learning), a long standing open problem in machine learning, refers to the ability of a model to learn continually from a stream of data, accommodating new knowledge and retaining previously learned experiences.

It is a concept to learn a model for a large number of tasks sequentially without forgetting knowledge obtained from the preceding tasks, where the data in the old tasks are not available any more during training new ones.

Continual Learning and the Plasticity-Stability Dilemma

The main challenge of continual learning is the plasticity-stability dilemma: if the model focuses too much on the stability, it suffers from poor forward transfer to the new task, and if it focuses too much on the plasticity, it suffers from the catastrophic forgetting of past tasks.

Forward Transfer : Forward transfer is the influence that learning a task has on the performance on a future task. Parameters should be “plastic” to learn new tasks/concepts, and thus excess stability in continual learning may make neural networks suffer from poor forward transfer.
Catastrophic Forgetting : Neural networks suffer from catastrophic forgetting, in which the learning of each new task causes the neural network to forget the models learned for the previous tasks. Such phenomenon makes the neural network hard to adopt to continual learning, since data of a certain stage cannot be used in the next stage in such case.

Regularization based Continual Learning

To address this dilemma, a comprehensive study for neural network-based continual learning was conducted broadly under the following categories: regularization-based, dynamic architecture-based, and replay memory-based methods, as displayed in the images below.

Common approaches for task-incremental learning: Regularization, Dynamic Architecture, Memory Replay [From Left to Right]

Regularization-based Methods : Retrain the whole network $x(t)$ while regularizing to prevent catastrophic forgetting with previously learned tasks $x(t-1)$
Dynamic Architecture-based Methods : Selectively train the network $x(t)$ and expand it if necessary to represent new tasks
Memory Replay-based Methods : Store the data discovered for [state, action, reward, next_state] that the agent observes, which is used as the raw data to feed into action-value calculations later

Our focus is the regularization-based methods, since they pursue to use the fixed-capacity neural network model as efficiently as possible, which may potentially allow them to be combined with other approaches.

These methods typically identify important learned weights for previous tasks and heavily penalize their deviations while learning new tasks.

A typical loss function form for a learning task $t$ looks like this:

$$ \mathcal{L}{t}(\boldsymbol{\theta})=\mathcal{L}{\mathrm{TS}, t}(\boldsymbol{\theta})+\sum_{i} \lambda_{i}\left(\theta_{i}-\hat{\theta}_{t-1, i}\right)^{2} $$

<aside> 💡 $\mathcal{L}_{\mathrm{TS}, t}(\boldsymbol{\theta})$ = Task-specific loss

$\lambda_{i}$ = Adaptive regularization for weight importance

$\hat{\theta}_{t-1, i}$ = Weight learned up to task $t-1$

$\theta_{i}$ = Weight of current task

</aside>

Intuitively, the loss function is calculated with the sum of task-specific loss $\mathcal{L}{\mathrm{TS}, t}(\boldsymbol{\theta})$ and the product of adaptive regularization for weight importance $\lambda{i}$ and the weight difference of the current and past model $\theta_{i}-\hat{\theta}_{t-1, i}$, of each parameter.