Let's suppose that we have to decompose Mean Squared Error (MSE) in terms of a Bias-Variance tradeoff, how should we get started?

Mean Squared Error (MSE)

First, we need to start by looking at Mean Squared Error. Let's go by the name of it. It's Mean Squared Error, meaning that there are these particular components:

Mean: something about an average or central tendency
Squared: we're probably going to need to be squaring something
Error: it’s probably measuring some kind of mistake or a deviation

Basically, we want some sort of a single measurement that averages the errors (how far off our predictions are from reality) into one single metric (hence the 'mean'), and squares those individual error points to prevent positive and negative mistakes from canceling. This means we now have a single number - basically a summary statistic - that captures the typical squared distance between our predictions and reality.

MSE in Different Contexts

Now that we understand the intuition behind why we should care about MSE. Let’s look at two different ways MSE is applied in context. The way we define MSE depends on what we're trying to measure. What I mean by this is that MSE can measure two different things: prediction error, and estimation error, and the formulas are actually different for each.

Let’s look at how MSE is used in each context mentioned:

MSE as a Measurement of Prediction Error

The first way to look at MSE is when we're assessing a predictor. It measures prediction error - comparing predicted values to actual values. For regression, MSE is defined as:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

In prediction context: MSE is measuring how far off our predictions are. Basically we're looking at the actual prediction errors themselves - the difference between what we are trying to estimate and what the actual data say. For example, let's say we predict temperatures for multiple cities. Each prediction has some error - the difference between predicted and actual temperature. We square these errors and take the average. Like when we predict y values and compare to actual y values.

MSE as a Measurement of Estimation Error

The second way to look at MSE is in the context of an estimator - when estimating a parameter from a sample. Here MSE measures how far our estimate $\theta$ is from the true parameter $\theta$:

$$ MSE(\hat{\theta}) := E_\theta[(\hat{\theta} - \theta)^2] \tag{1} $$

In estimation context: measuring how far our parameter estimates are from truth. Basically we're looking at how close our estimate from a sample gets to the true population parameter. This means that it is the difference between what we calculate from our data and what the actual parameter is. For example, let's say we want to estimate the average height of all students in Berkeley. We measure a sample of 100 students and calculate their mean height. This sample mean is our estimate, but it probably differs from the true population mean. We square this difference and take its expected value. Like when we estimate a population mean from a sample.

Given these different contexts of MSE, we learned that MSE gives us a single number to measure how far off we are - whether we're predicting actual values or estimating parameters. This is why MSE is useful: it's the metric that tells us how good our model is, whether it's too underfit or overfit to the data.

Which Context for MSE Decomposition?

There's an important reason why we had to distinguish between these contexts. What I mean by this is that when we look at the mathematical equations of MSE from each context, only in one context does decomposing MSE in terms of bias and variance make sense. Which one might that be? Let’s go back and look at the actual mathematical formulas:

In prediction, we defined MSE as:

$$ MSE := \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 $$

In estimation, we defined MSE as:

$$ MSE(\hat{\theta}) := E_\theta[(\hat{\theta} - \theta)^2] $$

When we look at MSE in the prediction context, notice that MSE is now just a single error estimate of the actual data we have. They're the specific values and predictions from our dataset. Said another way, the prediction MSE is calculating errors on fixed data points - the $y_i$'s and $\hat{y}_i$'s don't change. They're the specific values and predictions from our dataset.

On the other hand, in the estimation context MSE is defined in terms of Expected value - meaning we're averaging over all possible samples. Every time we sample, we get a different $\hat{\theta}$. This makes $\hat{\theta}$ a random variable. Now that we see $\hat{\theta}$ as a random variable, it has statistical properties - an expected value $E_\theta[\hat{\theta}]$ and a variance $Var(\hat{\theta})$. We can ask: where is it centered? How much does it vary? These are the questions that lead to bias and variance. This is where bias and variance come in. That's why MSE decomposition only makes sense in the estimation context, we need the randomness of $\hat{\theta}$ for bias and variance to exist.

Bias and Variance

To answer this question, let’s take a side step and look at the definitions of bias and variance. What are bias and variance? Let's start with the words themselves.

Different Representations of Bias

Colloquially: Bias is about being consistently off in the same way. It means that we are being systematically prejudiced or leaning in one direction. For example, imagine shooting arrows with your sight misaligned: all your shots cluster tightly, but the whole cluster lands to the right of the bullseye. No matter how many times you shoot, you’ll keep missing in that direction. The key word is systematic: bias isn’t random error, but a consistent distortion in one direction that arises from the assumptions built into a model.

Mathematically: Bias is the difference between where our estimator is centered (its expectation across all possible samples) and the true parameter value. Bias is defined as:

$$ Bias(\hat{\theta}) := E_\theta[\hat{\theta}] - \theta \tag{2}

Here $\hat{\theta}$ is the estimator (like a sample mean), $E_\theta[\hat{\theta}]$ is the expected value of our estimator across all possible samples, and $\theta$ is the true parameter.

Different Representations of Variance

Colloquially: variance explains variation, spread, and inconsistency. Basically, variance shows how sensitive our model is to the specific data we happened to see. If it's too specific, we probably "overfit," meaning that our model tracks too many nuances that might not be as generalizable. For example, imagine your sight is perfectly aligned, but your hand shakes every time you release an arrow. Sometimes you hit left, sometimes right, sometimes high or low. On average, your shots center on the bullseye, but they’re scattered all over the target. That spread is variance — it shows how sensitive your estimator is to the specific data you happen to see. The key word here is "sensitivity." Meaning that small changes in our data lead to big changes in our model; it's too dependent on the specific sample we happened to observe.

Mathematically: variance is the average squared distance of our estimator from its own expected value across all possible samples. Variance is defined as:

$$ Var(\hat{\theta}) := E_\theta[(\hat{\theta} - E_\theta[\hat{\theta}])^2] \tag{3} $$

Again, $\hat{\theta}$ is the estimator, $E_\theta[\hat{\theta}]$ is its expectation, and $\theta$ is the true parameter value. And variance measuring how much $\hat{\theta}$ deviates from its own center on average, how much it fluctuates around that expectation.

How They Combine

These two sources of error — bias and variance — add together to explain our total miss distance. Going back to our earlier archery analogy:

Bias is how far the center of our shot pattern is from the bullseye.
Variance is how widely the arrows scatter around that center.

When we measure mean squared error, we’re really measuring both at the same time: the spread of our estimator around its center (variance) and how far that center itself is from the truth (bias). In other words, how far our estimator tends to drift away from the truth (bias), and how much it wobbles around its own center (variance). MSE captures the whole picture because both effects contribute to how far off we are from reality.