SSD paper summary

https://browse.arxiv.org/pdf/2308.07707.pdf

Introduction

The challenge of machine unlearning can be thought of as a multi-objective task, conducting forgetting without degrading model performance on the remaining data.

Timeliness is a key constraint, as full retraining of a model without the to-be-forgotten data would yield the desired results but doing so is time and resource intensive. Similarly, light-weightiness refers to what preparation is necessary for the unlearning process, such as storing a list of samples and parameter updates for every training batch. This adds significant overhead and cannot be performed post-hoc.

SSD- distinguishing between generalized and specialized information, prioritizing the protection of generalized, broadly useful information while dampening parameters that are specialized towards to-be-forgotten samples.

We use the diagonal of the Fisher information matrix (FIM) to identify these specialized parameters.

Proposed method

The guiding intuition behind Selective Synaptic Dampening is that there likely exist parameters that are specifically important for Df (forget set) but not for Dr (retain set).

Untitled

Hessian and the Fisher information matrix- The sensitivity of ϕθ with respect to each parameter θk can be calculated via the second-order derivative of of the loss near the minimum. This sensitivity can be interpreted as the importance of each parameter. The diagonal of the Fisher information matrix is equivalent to the second derivative of the loss.

Two significant amendments to the pruning algorithm that lead to strong forgetting and retain-set performance while maintaining fast execution time. First, a stricter selection criterion is implemented, considering the parameter importance to the retain set. This step facilitates the identification of parameters that are highly specialized towards samples in the forget set, with α dictating how specialized they must be to be pruned.

Further the pruning step is replaced by a dampening step that applies a penalty to the magnitude of the parameter proportional to its relative importance of Df compared to D. λ is a hyper-parameter to control the level of protection.

Untitled

Intuitively, if λ = 1 then β < 1 for all parameters that are specialized towards Df . Therefore, β → 0 as a parameter becomes more specialized for Df . Since λ scales this update, this dampening factor is given an upper-bound of 1 to prevent large λ values from causing parameters to grow. The dampening effect, combined with the selection criteria, creates a granular method to forgetting.

Untitled