Model Selection

moving beyond the classical, least square fit of a linear model
moving on with the assumption of n(number of observations) >> p(number of predictors)
- we can increase the prediction accuracy and interpretability with shrinking the space of predictors
- and decrease the overall variance of the model with minimal cost of bias growth
- carefully pruning the feature space for the most important features, not only reduces the computational cost by decreasing the redundancies but also makes the model more compact and interpretable (this is commonly referred to as Feature selection)

common procedures employed to achieve this

Shrinkage: reducing the values of all the coefficients corresponding to all the p-predictors towards zero, has been shown to improve the variance
Subset selection: selecting a subset of the p predictors which are thought to be the best predictors
Dimensionality reduction: projecting the p-dimensional data on some M dimensions using linear transformations, reduces the complexity and captures the predictors with most variance

using the above mentioned techniques, requires some metric which can be used to evaluate the performance of the models and judge which techniques obviously perform better than other

(training set) Mean squared error = Residual square sum / n
- but this metric has no adjustments to capture the test error, it is purely based on the performance of the model on the training data - which betters as we include more and more data
- hence this is not a good metric to choose a model
$C_p = \frac{1}{n}(RSS+2d\hat\sigma^2)$ : lower is better
- adds the penalty to the RSS based on the number of predictors
- where d is the number of estimators
- sigma_squared is the variance of the error epsilon associated with each response measurement (measured using all the predictors)
$\text{adjusted }R^2 = 1- \frac{\frac{RSS}{(n-d-1)}}{\frac {TSS}{(n-1)}}$
- the extra penalty is for the noise variable added over the most important predictors captured in ‘d’

Screenshot 2025-07-31 at 3.28.00 PM.png

two methods, aim to reduce the coefficients towards zero(ridge) or completely zero a coefficient (lasso) thereby changing the relative importance of the involved features