⚠️ Estimation is the first hurdle for most students, so let me take you through the logic STEP-BY-STEP and hopefully you will know intuitively what the equations are doing and derive the equations yourself
We have our equation
$$ y_i = \beta_0 + \beta_1 x_i + u_i $$
[IMPORTANT] Next question is: Can we find out $\{\beta_0, \beta_1\}$?
How are we going to estimate the parameters — find $\{\hat{\beta_0}, \hat{\beta_1}\}$


We can look at the graph but maths have no eyes, and economist want a bit more granularity than drawing a line so we need to try something else.
Let’s go back to our example and try to see if we can find something find something useful to estimate ****$\{\hat{\beta_0}, \hat{\beta_1}\}$
Assume that we randomly pick 2 sets of $\{\hat{\beta_0}, \hat{\beta_1}\}$ and draw out the line they produce

I think we can intuitively see that the yellow line fits better than the blue one. But we need a way to formalise this intuition
One way to evaluate the fit is the look at the distance between our observations and the line (predicted value)

If we can add up the distances between the observation and the line, aka the error, then we can compare two numeric value and conclude that the yellow line is a better fit.
But the error have both positive value and negative value and we don’t want the numbers to cancel when we are summing up the errors, we have square it.
⭐️ Putting our intuition together, we can construct a statistic called Sum of Squared Residuals where residual is just another name of error/ distance between the line and observation.
$$ SSR = u^2_1 + u^2_2 + u^2_3 + ... $$
In the example
$SSR_{yellow} = (−2.8477)^2 + (+2.5275)^2 + (−1.7413)^2 = 17.529$
$SSR_{blue} = (−5.3477)^2 + (+3.5275)^2 + (+2.2587)^2 = 46.142$
Yellow line’s SSR is smaller → better fit.
‼️You should fully understand the example before continuing on with the text because it’s getting a bit dry and an intuitive understanding is critical to continue on
‼️Another important note: pay close attention to the notation and try to distinguish between the true value vs. estimate value
In the example we mention that we can use the Sum of Squared Residuals (SSR/ or sometimes called RSS) to compare the fit of 2 ****$\{\hat{\beta_0}, \hat{\beta_1}\}$ estimations.
Let’s dive a bit deeper into the residuals. Recall the model equation in estimation form:
$y_i = \hat{\beta_0} + \hat{\beta_1} x_i + \hat{u}_i$
$\hat{u_i}$ is the residual
we can rewrite the equation as
$$ \hat{u}_i = y_i - \hat{\beta_0} - \hat{\beta_1} x_i $$
Important note on notation
Next remember to square the residual and add them up across all sample from $1$ to $N$
$$ \mathrm{SSR}(\hat\beta_0,\hat\beta_1)= \sum_{i=1}^n \hat{u}^2_i = \sum_{i=1}^n \big(y_i - \hat\beta_0 - \hat\beta_1 x_i\big)^2 $$
In some text they will take the mean and call is Mean Squared Errors, but it’s just adding $\frac{1}{N}$ in front. For subsequent calculation it does not matter whether you add it or not. → if your prof like it then add it if not you can decide whether you want to add it
Now that we know how to compare the fit of the estimates, we want to use math to find out the BEST $\{\hat{\beta_0}, \hat{\beta_1}\}$ pair
Since we have a nice equation for $SSR$ and it’s a continuous function (in our wage and education example). Let’s plot the change in SSR when we change $\hat{\beta_1}$

Can you see that there’s a minima?
Recall your 1st year math! If we have a continuous function $f(x)$ how do we find $x$ that give us the minimum value of $f(x)$?
First order condition!
Let’s apply this logic to SSR
$$ \begin{aligned}\mathrm{SSR}(\beta_0,\beta_1) &= \sum_{i=1}^n u_i^2 = \sum_{i=1}^n \big(y_i - \beta_0 - \beta_1 x_i\big)^2 \\[6pt]\frac{\partial \mathrm{SSR}}{\partial \beta_0}&= -2\sum_{i=1}^n \big(y_i-\beta_0-\beta_1 x_i\big) = 0 \\[6pt]\frac{\partial \mathrm{SSR}}{\partial \beta_1}&= -2\sum_{i=1}^n x_i \big(y_i-\beta_0-\beta_1 x_i\big) = 0\end{aligned} $$
⚠️ Note that here we don’t use $\hat{.}$ because it’s still unknown parameters not an estimate yet, it is confusing, I know
Let’s drive the estimates:
Expand and rearrange to get normal equations - Eq 1 and Eq 2
$$ \begin{aligned}\sum_{i=1}^n y_i &= n\,\beta_0 + \beta_1 \sum_{i=1}^n x_i, \\\sum_{i=1}^n x_i y_i &= \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2\end{aligned} $$
From Eq1
$$ \beta_0 = \bar y - \beta_1 \bar x $$
Substitute $\beta_0$ equation into Eq 2
$$ \begin{aligned} \sum x_i y_i &= (\bar y-\beta_1\bar x)\sum x_i + \beta_1 \sum x_i^2 \\&= n\bar x\,\bar y - \beta_1 n\bar x^2 + \beta_1 \sum x_i^2\\ \sum x_i y_i - n\bar x\,\bar y &= \beta_1\Big(\sum x_i^2 - n\bar x^2\Big)\\ \hat \beta_1 &= \frac{\sum x_i y_i - n\bar x\,\bar y}{\sum x_i^2 - n\bar x^2} \end{aligned}\\ $$
Now recognised that, ask ChatGPT if you cannot proof it
$$ \sum (x_i-\bar x)^2 = \sum x_i^2 - n\bar x^2 \\ \sum (x_i-\bar x)(y_i-\bar y) = \sum x_i y_i - n\bar x\,\bar y $$
We have
$$ \begin{aligned} \hat\beta_1 &= \frac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}\\ \text{Equivalent to:}\\ \hat\beta_1 &= \frac{Cov(x,y)}{Var(x)} \end{aligned} $$
Plugging $\hat \beta_1$ to $\beta_0$equation
$$ \hat \beta_0 = \bar y = \hat \beta_1 \bar x $$
That is the full chain: FOC → normal equations → $\beta_0$ equation → solve for $\hat \beta_1$ → back-substitute for $\hat \beta_0$