Consider $\mathbf{y}\in \mathbb{R}^m$ to be the actual values we want to predict, and $\mathbf{\hat{y}} \in \mathbb{R}^m$ is the prediction made by our model. More specifically, $\mathbf{\hat{y}}=f(\mathbf{X}, \mathbf{w})$, where $f$ is the prediction model. We can use several objective functions to evaluate our model. In this section, we will discuss the most commonly used ones: MeanAbsoluteError, MeanSquaredError and HuberError in the regression setting.
Equation:
$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m|y_i - \hat{y}_i|$
Derivative:
$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_i]=\begin{cases}\frac{1}{m}\;&\text{if } \hat{y}_i > y_i\\-\frac{1}{m}&\text{otherwise}\end{cases}$
Properties:
> mae = tf.keras.losses.MeanAbsoluteError()
> mae(y_true, y_pred, sample_weight)

Equation:
$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m(y_i - \hat{y}_i)^2$
Derivative:
$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=-\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$
Properties:
> mse = tf.keras.losses.MeanSquaredError()
> mse(y_true, y_pred, sample_weight)

Equation:
$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m\begin{cases}\frac{1}{2}(y_i-\hat{y}_i)^2\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\\delta(|y_i-\hat{y}_i|-\frac{1}{2}\delta)&\text{otherwise}\end{cases}$
Derivative:
$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_i]=\begin{cases}\hat{y}_i-y_i\;\;\;&\text{if}\;|y_i-\hat{y}_i|\leq\delta\\-\delta&\text{if}\;|y_i-\hat{y}_i|>\delta\text{ and }|y_i-\hat{y}_i|>0\\\delta&\text{otherwise}\end{cases}$
Properties:
> huber = tf.keras.losses.Huber()
> huber(y_true, y_pred, sample_weight)

Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it dows not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the probability to assign sample $i$ to class $j$
Equation:
$H(\mathbf{y}, \mathbf{\hat{y}})=-\frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{j=1}^Ky_{ij}\ln\hat{y}_{ij}$
Derivative:
$\frac{\partial H(\mathbf{y}, \mathbf{\hat{y}})}{\partial\mathbf{\hat{y}}}=[d_{ij}]=-\frac{y_{ij}}{\hat{y}_{ij}}$
Properties:
CategoricalCrossentropy: Used when y_true is an one-hot encoded vector.
> y_true = [[0, 1, 0], [0, 0, 1]]
> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
> cce = tf.keras.losses.CategoricalCrossentropy()
> cce(y_true, y_pred, sample_weight=[1, 1])
SparseCategoricalCrossentropy: Used when y_true is a label encoded vector.
> y_true = [1, 2]
> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
> scce = tf.keras.losses.SparseCategoricalCrossentropy()
> scce(y_true, y_pred, sample_weight=[1, 1])

Consider $y_{ij}=1$ denotes the sample $i$ when it belongs to class $j$, $y_i=0$ denotes the sample when it does not belong to class $j$, $\hat{y}_{ij}$ denotes the prediction of the distance to assign sample $i$ to class $j$, the positive value suggests higher confidence while the negative value suggests lower confidence. The prediction value can be an unbounded continuous value.
Equation:
$\text{neg}i =\text{max}{j}((1-y_{ij})\hat{y}_{ij})$
$\text{pos}=\sum\limits_{j=1}^my_{ij}\hat{y}_{ij}$
$L(\mathbf{y}, \mathbf{\hat{y}})=\frac{1}{m}\sum\limits_{i=1}^m\text{max}(0, 1 + \text{neg}_i - \text{pos})$
Derivative:
$\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial \mathbf{\hat{y}}}=[d_{ij}]=\begin{cases}1-2y_{ij}\;&\text{if }{(1-y_{ij})\hat{y}{ij}}=\text{neg}i\text{ and neg}i-\text{pos}>-1\\-y{ij}\;&\text{if }{(1-y{ij})\hat{y}{ij}}\ne\text{neg}_i\text{ and neg}_i-\text{pos}>-1\\0&\text{otherwise}\end{cases}$
Properties:
> y_true = [[1., 0.], [0., 1.]]
> y_pred = [[0.6, 0.2], [-2, 0.1]]
> h = tf.keras.losses.CategoricalHinge()
> h(y_true, y_pred, sample_weight=[1, 1])

Let's first decompose the binary classification problem with sigmoid output:
$\mathbf{\hat{y}}=\sigma(\mathbf{\hat{X}}),\;\;\;\mathbf{\hat{x}}=\mathbf{X}\mathbf{w}+b$
Now, consider the gradient of mean square error with respect to $\mathbf{\hat{y}}$:
$\nabla_{\mathbf{\hat{y}}}L(\mathbf{y}, \mathbf{\hat{y}})=\begin{bmatrix}\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_1} \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_2} \\\vdots \\\frac{\partial L(\mathbf{y}, \mathbf{\hat{y}})}{\partial\hat{y}_m}\end{bmatrix}=\begin{bmatrix}-\frac{2}{m}(y_1 - \hat{y}_1) \\-\frac{2}{m}(y_2 - \hat{y}_2) \\\vdots \\-\frac{2}{m}(y_1 - \hat{y}_m)\end{bmatrix}=\frac{2}{m}(\mathbf{y}-\mathbf{\hat{y}})$