First, we need to start by looking at Mean Squared Error. Let's go by the name of it. It's Mean Squared Error, meaning that there are these particular components:
Basically, we want some sort of a single measurement that averages the errors (how far off our predictions are from reality) into one single metric (hence the 'mean'), and squares those individual error points to prevent positive and negative mistakes from canceling. This means we now have a single number - basically a summary statistic - that captures the typical squared distance between our predictions and reality.
Now that we understand the intuition behind why we should care about MSE. Let’s look at two different ways MSE is applied in context. The way we define MSE depends on what we're trying to measure. What I mean by this is that MSE can measure two different things: prediction error, and estimation error, and the formulas are actually different for each.
Let’s look at how MSE is used in each context mentioned:
The first way to look at MSE is when we're assessing a predictor. It measures prediction error - comparing predicted values to actual values. For regression, MSE is defined as:
$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$
In prediction context: MSE is measuring how far off our predictions are. Basically we're looking at the actual prediction errors themselves - the difference between what we are trying to estimate and what the actual data say. For example, let's say we predict temperatures for multiple cities. Each prediction has some error - the difference between predicted and actual temperature. We square these errors and take the average. Like when we predict y values and compare to actual y values.
The second way to look at MSE is in the context of an estimator - when estimating a parameter from a sample. Here MSE measures how far our estimate $\theta$ is from the true parameter $\theta$:
$$ MSE(\hat{\theta}) := E_\theta[(\hat{\theta} - \theta)^2] \tag{1} $$
In estimation context: measuring how far our parameter estimates are from truth. Basically we're looking at how close our estimate from a sample gets to the true population parameter. This means that it is the difference between what we calculate from our data and what the actual parameter is. For example, let's say we want to estimate the average height of all students in Berkeley. We measure a sample of 100 students and calculate their mean height. This sample mean is our estimate, but it probably differs from the true population mean. We square this difference and take its expected value. Like when we estimate a population mean from a sample.
Given these different contexts of MSE, we learned that MSE gives us a single number to measure how far off we are - whether we're predicting actual values or estimating parameters. This is why MSE is useful: it's the metric that tells us how good our model is, whether it's too underfit or overfit to the data.
There's an important reason why we had to distinguish between these contexts. What I mean by this is that when we look at the mathematical equations of MSE from each context, only in one context does decomposing MSE in terms of bias and variance make sense. Which one might that be? Let’s go back and look at the actual mathematical formulas:
In prediction, we defined MSE as:
$$ MSE := \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$
In estimation, we defined MSE as:
$$ MSE(\hat{\theta}) := E_\theta[(\hat{\theta} - \theta)^2] $$
When we look at MSE in the prediction context, notice that MSE is now just a single error estimate of the actual data we have. They're the specific values and predictions from our dataset. Said another way, the prediction MSE is calculating errors on fixed data points - the $y_i$'s and $\hat{y}_i$'s don't change. They're the specific values and predictions from our dataset.
On the other hand, in the estimation context MSE is defined in terms of Expected value - meaning we're averaging over all possible samples. Every time we sample, we get a different $\hat{\theta}$. This makes $\hat{\theta}$ a random variable. Now that we see $\hat{\theta}$ as a random variable, it has statistical properties - an expected value $E_\theta[\hat{\theta}]$ and a variance $Var(\hat{\theta})$. We can ask: where is it centered? How much does it vary? These are the questions that lead to bias and variance. This is where bias and variance come in. That's why MSE decomposition only makes sense in the estimation context, we need the randomness of $\hat{\theta}$ for bias and variance to exist.