Loss function(cost function)
The loss function quantifies our unhappiness with predictions on the training set
convention: the lower loss, the better performance
- Mean Squared Error (MSE): usually used for regression problems, $\frac{1}{n} \sum_{i=1}^n (f(x_i)-y_i)^2$
- Mean Absolute Error (MAE): regression problems, $\frac{1}{n} \sum_{i=1}^n |f(x_i)-y_i|$
- Cross-entropy loss: usually used for classification problems, $\frac{-1}{n} \sum_{i=1}^n \sum_{c=1}^m y_{i,c} \log p_{i,c}$
classification: SVM loss and Softmax loss, see Linear Classification
regression:
Regularization
TA Explanation: We add biases to take into account the priors of the input data to that layer. We don’t want to impose any penalty on making this large; if an input always has a lot more red, we want the bias term to take care of that. (Edited from False to True based on https://piazza.com/class/j0vi72697xc49k?cid=1473 )
- hyperparameter: $\lambda$
- there might be many similar W (set of weights) that correctly classify the examples
- L1 norm: $R(W) = \sum_k\sum_l |W_{k,l}|$
- squared L2 norm that discourages large weights: $R(W) = \sum_k\sum_l W_{k,l}^2$
- Elastic net(L1+L2): $R(W)=\sum_k \sum_l \beta W_{k,l}^2+|W_{k,l}|$
advantage:
- Express preferences over weights