Using examples about Pittsburgh!

Source

Source

Probability

We ask what’s the probability of observing data we haven’t observed yet (”data” means random variable $X$) in range $[a, b]$ (if the random variable is continuous) or what’s the probability of observing data equals $x$ (if the random variable is discrete). To answer that, we follow-up by asking “what’s the distribution of the data (random variable)?”. Then we can answer: $ℙ(\text{data}|\text{distribution})$. For the continuous and discrete cases, we write $ℙ(X\in[a,b]|\text{distribution})$ and $ℙ(X=x|\text{distribution})$, respectively.

To answer these questions, you need to give me the data of heights and weights of people in Pittsburgh, respectively. Often times, however, we don’t actually know the true distribution and want to find it. So we gather enough data from the distribution and try to estimate its parameters. This is where posterior probability comes in.


Posterior and Likelihood

We ask “what’s the probability that this distribution is the true underlying distribution?”. ****To answer that, we follow-up by asking “what data has this distribution generated?”, which means we’ve already observed some data (evidence). Then we can answer: $ℙ(\text{distribution}|\text{data})$, which we call “posterior” probability.

To answer these questions, you need to give me the data of 6’2”-tall people and 160-pound people in all towns, respectively. Let’s consider the 6’2”-tall people. If you tell me how many 6’2”-tall people there are in each of 100 possible towns, I can find the frequency of observing a 6’2”-tall person in each town, including Pittsburgh. This is essentially Bayes’ theorem! Let’s formalize it:

We ask ourselves: the data could come from Pittsburgh’s distribution with probability $ℙ(\text{distribution}|\text{data})$ or from any other town with probability $ℙ(\text{not distribution}|\text{data})$. What we want is the first one, which can be written like this:

$$ ℙ(\text{distribution}|\text{data})=\frac{ℙ(\text{distribution \& data})}{ℙ(\text{data})}

$$

What do the numerator and denominator mean? In other words, when we talk about the probability of data or data and distribution together (joint), what do we actually mean? The frequentist interpretation of probability would say: count the number of times the desired event happened and divide it by the total number of possible events. Doing this for $ℙ(\text{data})$, then, means we need to find out how many times this specific data occurs, and divide it by the total number of times any data could happen. For example, if data is “among 100 random people, 40 of them were 6’2” tall”, then the total events that could happen in this sample are: