We want to take a sentence with $k$ words and assign a probability to it - this lets us rank alternatives.

$$ P(w_1, w_2, ..., w_k) = \text{Probability of encountering sentence } w_1w_2...w_k $$

This is useful in several applications:

Note for $P(w_1, w_2, ..., w_k)$, we can apply chain rule for multiplication:

$$ \begin{align*} P(w_1, ...,w_k) &= P(w_1) \cdot P(w_2, ..., w_k | w_1) \\ &= P(w_1) \cdot P(w_2 | w_1) \cdot ... \cdot P(w_k | w_1, w_2, ..., w_k) \end{align*} $$

However, using this approach directly will take a massive number of passes - very inefficient (intractable).

Instead, we use the idea of an N-gram, i.e. that the probability of the next work only depends on the previous N-1 words, i.e. based on the idea that $P(w_k | w_1, w_2, ..., w_{k-1}) \approx P(w_{k-(N-1)}, ... w_{k-1})$.

The individual N-gram probabilities can be calculated with Hadoop (you have done this before!). Usually, don’t go above 5-grams.

e.g. Calculating bigrams:

image.png

With this naïve model, sentences which have never occurred have 0 probability of occurring (i.e. a single unusual word turns a sentence impossible). Thus, we remove 0 probabilities to “smooth” the probability distributions.