Objective Functions

Regression Objective Functions

Consider $\mathbf{y}\in \mathbb{R}^m$ to be the actual values we want to predict, and $\mathbf{\hat{y}} \in \mathbb{R}^m$ is the prediction made by our model. More specifically, $\mathbf{\hat{y}}=f(\mathbf{X}, \mathbf{w})$, where $f$ is the prediction model. We can use several objective functions to evaluate our model. In this section, we will discuss the most commonly used ones: MeanAbsoluteError, MeanSquaredError and HuberError in the regression setting.

Mean Absolute Error (L1 Loss)

Equation:

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m|y_i - \hat{y}_i|$

Derivative:

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_i]=\begin{cases}\frac{1}{m}\;&\text{if } \hat{y}_i > y_i\\-\frac{1}{m}&\text{otherwise}\end{cases}$

Properties:

Less sensitive to samples with large residual between prediction and actual value.

> mae = tf.keras.losses.MeanAbsoluteError()
> mae(y_true, y_pred, sample_weight)

Mean Squared Error (L2 Loss)

Equation:

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m(y_i - \hat{y}_i)^2$

Derivative:

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=-\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$

Properties:

More sensitive to samples with large residual between prediction and actual value.

> mse = tf.keras.losses.MeanSquaredError()
> mse(y_true, y_pred, sample_weight)

Huber Loss

Equation:

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m\begin{cases}\frac{1}{2}(y_i-\hat{y}_i)^2\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\\delta(|y_i-\hat{y}_i|-\frac{1}{2}\delta)&\text{otherwise}\end{cases}$

Derivative:

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_i]=\begin{cases}\hat{y}_i-y_i\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\-\delta&\text{if}\;|y_i-\hat{y}_i|>\delta\text{ and }|y_i-\hat{y}_i|>0\\\delta&\text{otherwise}\end{cases}$

Properties:

Hyperparameter $\delta$ can be used to control how many penalties should be given to samples with large residual.

> huber = tf.keras.losses.Huber()
> huber(y_true, y_pred, sample_weight)

Classification Loss

Cross Entropy

Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it dows not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the probability to assign sample $i$ to class $j$

Equation:

$H(\mathbf{y}, \mathbf{\hat{y}})=-\frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{j=1}^Ky_{ij}\ln\hat{y}_{ij}$

Derivative:

$\frac{\partial H(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_{ij}]=-\frac{y_{ij}}{\hat{y}_{ij}}$

Properties:

The inferred probability $\mathbf{\hat{y}}$ is more accurate compare to hinge loss.
Sigmoid or softmax activation functions are preferred to used in the output layer.

CategoricalCrossentropy: Used when y_true is an one-hot encoded vector.

> y_true = [[0, 1, 0], [0, 0, 1]]
> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
> cce = tf.keras.losses.CategoricalCrossentropy()
> cce(y_true, y_pred, sample_weight=[1, 1])

SparseCategoricalCrossentropy: Used when y_true is a label encoded vector.

> y_true = [1, 2]
> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
> scce = tf.keras.losses.SparseCategoricalCrossentropy()
> scce(y_true, y_pred, sample_weight=[1, 1])

Hinge Loss (Crammer and Singer)

Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it does not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the distance to assign sample $i$ to class $j$, the positive value suggests higher confidence while the negative value suggests lower confidence. The prediction value can be an unbounded continuous value.

Equation:

$\text{neg}i =\text{max}{j}((1-y_{ij})\hat{y}_{ij})$

$\text{pos}=\sum\limits_{j=1}^my_{ij}\hat{y}_{ij}$

$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m\text{max}(0, 1 + \text{neg}_i - \text{pos})$

Derivative:

$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_{ij}]=\begin{cases}1-2y_{ij}\;&\text{if }{(1-y_{ij})\hat{y}{ij}}=\text{neg}i\text{ and neg}i-\text{pos}>-1\\-y{ij}\;&\text{if }{(1-y{ij})\hat{y}{ij}}\ne\text{neg}_i\text{ and neg}_i-\text{pos}>-1\\0&\text{otherwise}\end{cases}$

Properties:

Data point far away from the decision boundary do not contribute to the loss function.
Linear or hyperbolic tangent activation functions are preferred to used in the output layer.
Need additional methods (e.g. Platt scaling) to estimate probability for each class.

> y_true = [[1., 0.], [0., 1.]]
> y_pred = [[0.6, 0.2], [-2, 0.1]]
> h = tf.keras.losses.CategoricalHinge()
> h(y_true, y_pred, sample_weight=[1, 1])

Why MSE and MAE are not suitable for classification in neural networks

Binary Classification with Probability

Let's first decompose the binary classification problem with sigmoid output:

$\mathbf{\hat{y}}=\sigma(\mathbf{\hat{X}}),\;\;\;\mathbf{\hat{x}}=\mathbf{X}\mathbf{w}+b$

Now, consider the gradient of mean square error with respect to $\mathbf{\hat{y}}$:

$\nabla_{\mathbf{\hat{y}}}L(\mathbf{y}, \mathbf{\hat{y}})=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1} \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2} \\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\end{bmatrix}=\begin{bmatrix}-\frac{2}{m}(y_1 - \hat{y}_1) \\-\frac{2}{m}(y_2 - \hat{y}_2) \\\vdots \\-\frac{2}{m}(y_1 - \hat{y}_m)\end{bmatrix}=\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$