Model Selection
- moving beyond the classical, least square fit of a linear model
- moving on with the assumption of n(number of observations) >> p(number of predictors)
- we can increase the prediction accuracy and interpretability with shrinking the space of predictors
- and decrease the overall variance of the model with minimal cost of bias growth
- carefully pruning the feature space for the most important features, not only reduces the computational cost by decreasing the redundancies but also makes the model more compact and interpretable (this is commonly referred to as Feature selection)
common procedures employed to achieve this
- Shrinkage: reducing the values of all the coefficients corresponding to all the p-predictors towards zero, has been shown to improve the variance
- Subset selection: selecting a subset of the p predictors which are thought to be the best predictors
- Dimensionality reduction: projecting the p-dimensional data on some M dimensions using linear transformations, reduces the complexity and captures the predictors with most variance
Choosing the Optimal Model
- using the above mentioned techniques, requires some metric which can be used to evaluate the performance of the models and judge which techniques obviously perform better than other
Common metrics used to estimate the test error
- (training set) Mean squared error = Residual square sum / n
- but this metric has no adjustments to capture the test error, it is purely based on the performance of the model on the training data - which betters as we include more and more data
- hence this is not a good metric to choose a model
- $C_p = \frac{1}{n}(RSS+2d\hat\sigma^2)$ : lower is better
- adds the penalty to the RSS based on the number of predictors
- where d is the number of estimators
- sigma_squared is the variance of the error epsilon associated with each response measurement (measured using all the predictors)
- $\text{adjusted }R^2 = 1- \frac{\frac{RSS}{(n-d-1)}}{\frac {TSS}{(n-1)}}$
- the extra penalty is for the noise variable added over the most important predictors captured in ‘d’
these discussed methods, assume the absence of test-data, but in modern workflows we prefer using validation and cross-validation regimes where test data is readily available, hence other than some niche applications, these metrics (other than adjusted R2 scores) are seldom used.

Ridge and Lasso in LR - Regularisation
- two methods, aim to reduce the coefficients towards zero(ridge) or completely zero a coefficient (lasso) thereby changing the relative importance of the involved features