
$f(\bold x;W,\bold b)=W\bold x+\bold b$
Bias trick: extend vector to have a extra 1, and matrix W to have b as a column
X_train = np.hstack([ X_train, np.ones((X_train.shape[0], 1)) ])
$f(\bold x;W,\bold b)=W\bold x$
(this trick is not that convenient to use in NN)
Multiclass Support Vector Machine(SVM) classifier
hinge/max-margin loss
- hyperparameter $\Delta$
- The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of $\Delta$.
- for those class with score less than the real class with more than $\Delta$, it’s fine
- otherwise, its difference will be accumulated in loss
- $\text{example }i: (x_i,y_i) \to s$
- $y_i$ are the ground truth labels
- the output vector s contains the scores for each class
Structured SVM [Weston Watkins 1999]:
- hinge $L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$
- squared hinge loss: in the form of max^2 to give a larger penalty to difference
One vs All (OVA): a separate binary SVM is trained for every class independently. The arguably simplest OVA strategy is likely to work just as well (as also argued by Rikin et al. 2004 in In Defense of One-Vs-All Classification (pdf) )
All-vs-All (AVA): least common
observation