Term Frequency and Inverse Document Frequency
⇒ Come to solve the problem of “Semantic meaning between words that is not captured using BOW”
⇒ The rare words must have higher weighted when we are creating the vectors
EX:
D1 → He is a good boy
D2 → She is a good girl
D3 → Boy and girl are good
Term Frequency → For every sentence
good | boy | girl | |
---|---|---|---|
Doc1 | 1/2 | 1/2 | 0 |
Doc2 | 1/2 | 0 | 1/2 |
Doc3 | 1/3 | 1/3 | 1/3 |
Inverse Document Frequency → For every word
Words | IDF |
---|---|
good | log(3/3) = 0 |
boy | log(3/2) |
girl | log(3/2) |
TFIDF
good | boy | girl | |
---|---|---|---|
Doc1 | 1/2 * log(3/3) = 0 | 1/2 * log(3/2) | 0 * log(3/2) = 0 |
Doc2 | 1/2 * log(3/3) = 0 | 0 * log(3/2) = 0 | 1/2 * log(3/2) |
Doc3 | 1/3 * log(3/3) = 0 | 1/3 * log(3/2) | 1/3 * log(3/2) |
The weight of word good is zero because it’s found at every sentence