Probability

We ask what’s the probability of observing data we haven’t observed yet (”data” means random variable $X$) in range $[a, b]$ (if the random variable is continuous) or what’s the probability of observing data equals $x$ (if the random variable is discrete). To answer that, we follow-up by asking “what’s the distribution of the data (random variable)?”. Then we can answer: $ℙ(\text{data}|\text{distribution})$. For the continuous and discrete cases, we write $ℙ(X\in[a,b]|\text{distribution})$ and $ℙ(X=x|\text{distribution})$, respectively.

When we say “data” is our random variable, we must notice that sometimes this random variable is itself a set of several random variables. For example, we could ask “what’s the probability of observing a 6’2” person in Pittsburgh?” (one draw from the population; one random variable) or we could ask “what’s the probability that when we take a sample of 100 people from Pittsburgh, 40 of them are 6’2”?” (a sample—multiple random variables—which follows a multinomial distribution in this case).
e.g., what’s the probability that the next person I see is 6’2” tall (my height), given the distribution of height in my town (Pittsburgh)? (height is discrete in this case).
e.g., or what’s the probability that the next person I see weighs between 160-170 pounds (my weight), given the distribution of weight in my town? (weight is continuous in this case).

To answer these questions, you need to give me the data of heights and weights of people in Pittsburgh, respectively. Often times, however, we don’t actually know the true distribution and want to find it. So we gather enough data from the distribution and try to estimate its parameters. This is where posterior probability comes in.

Posterior and Likelihood

We ask “what’s the probability that this distribution is the true underlying distribution?”. ****To answer that, we follow-up by asking “what data has this distribution generated?”, which means we’ve already observed some data (evidence). Then we can answer: $ℙ(\text{distribution}|\text{data})$, which we call “posterior” probability.

e.g., the person I saw was 6’2” tall. What is the probability that he is from Pittsburgh?
e.g., the person I saw weighed 160 pounds. What is the probability that she is from Pittsburgh?

To answer these questions, you need to give me the data of 6’2”-tall people and 160-pound people in all towns, respectively. Let’s consider the 6’2”-tall people. If you tell me how many 6’2”-tall people there are in each of 100 possible towns, I can find the frequency of observing a 6’2”-tall person in each town, including Pittsburgh. This is essentially Bayes’ theorem! Let’s formalize it:

We ask ourselves: the data could come from Pittsburgh’s distribution with probability $ℙ(\text{distribution}|\text{data})$ or from any other town with probability $ℙ(\text{not distribution}|\text{data})$. What we want is the first one, which can be written like this:

$$ ℙ(\text{distribution}|\text{data})=\frac{ℙ(\text{distribution \& data})}{ℙ(\text{data})}

What do the numerator and denominator mean? In other words, when we talk about the probability of data or data and distribution together (joint), what do we actually mean? The frequentist interpretation of probability would say: count the number of times the desired event happened and divide it by the total number of possible events. Doing this for $ℙ(\text{data})$, then, means we need to find out how many times this specific data occurs, and divide it by the total number of times any data could happen. For example, if data is “among 100 random people, 40 of them were 6’2” tall”, then the total events that could happen in this sample are:

among 100 random people, 40 of them were 6’1” tall
among 100 random people, 39 of them were 6’2” tall
among 100 random people, 39 of them were 6’1” tall