Loss function(cost function)

The loss function quantifies our unhappiness with predictions on the training set

convention: the lower loss, the better performance

Mean Squared Error (MSE): usually used for regression problems, $\frac{1}{n} \sum_{i=1}^n (f(x_i)-y_i)^2$
Mean Absolute Error (MAE): regression problems, $\frac{1}{n} \sum_{i=1}^n |f(x_i)-y_i|$
Cross-entropy loss: usually used for classification problems, $\frac{-1}{n} \sum_{i=1}^n \sum_{c=1}^m y_{i,c} \log p_{i,c}$

classification: SVM loss and Softmax loss, see Linear Classification

regression:

Regularization

TA Explanation: We add biases to take into account the priors of the input data to that layer. We don’t want to impose any penalty on making this large; if an input always has a lot more red, we want the bias term to take care of that. (Edited from False to True based on https://piazza.com/class/j0vi72697xc49k?cid=1473 )

hyperparameter: $\lambda$
there might be many similar W (set of weights) that correctly classify the examples
L1 norm: $R(W) = \sum_k\sum_l |W_{k,l}|$
squared L2 norm that discourages large weights: $R(W) = \sum_k\sum_l W_{k,l}^2$
Elastic net(L1+L2): $R(W)=\sum_k \sum_l \beta W_{k,l}^2+|W_{k,l}|$

advantage:

Express preferences over weights