Natural Language Processing

We want to take a sentence with $k$ words and assign a probability to it - this lets us rank alternatives.

$$ P(w_1, w_2, ..., w_k) = \text{Probability of encountering sentence } w_1w_2...w_k $$

This is useful in several applications:

machine translations
- e.g. $P(\text{"High winds expected"}) > P(\text{"Large winds expected"})$
spell checker not fooled by homophones
- e.g. $P(\text{"Waterloo is a great city"}) > P(\text{"Waterloo is a grate city"})$
speech recognition
- e.g. $P(\text{"I saw a van"}) > P(\text{"Eyes awe of an"})$

Note for $P(w_1, w_2, ..., w_k)$, we can apply chain rule for multiplication:

$$ \begin{align*} P(w_1, ...,w_k) &= P(w_1) \cdot P(w_2, ..., w_k | w_1) \\ &= P(w_1) \cdot P(w_2 | w_1) \cdot ... \cdot P(w_k | w_1, w_2, ..., w_k) \end{align*} $$

However, using this approach directly will take a massive number of passes - very inefficient (intractable).

Instead, we use the idea of an N-gram, i.e. that the probability of the next work only depends on the previous N-1 words, i.e. based on the idea that $P(w_k | w_1, w_2, ..., w_{k-1}) \approx P(w_{k-(N-1)}, ... w_{k-1})$.

e.g. $N = 1 \implies P(w_1, w_2, ...,) = P(w_1) P(w_2) ... P(w_k)$ (i.e. assume independent words)
e.g. $N = 2 \implies P(w_1, w_2, ...,) = P(w_1) \cdot P(w_2 | w_1) ... P(w_k | w_{k-1})$ (consider previous word)
e.g. Google uses N-grams for making suggestions (N is small)

The individual N-gram probabilities can be calculated with Hadoop (you have done this before!). Usually, don’t go above 5-grams.

e.g. Calculating bigrams:

With this naïve model, sentences which have never occurred have 0 probability of occurring (i.e. a single unusual word turns a sentence impossible). Thus, we remove 0 probabilities to “smooth” the probability distributions.