When we remove stopwords we must take care about the negative because if we used in a sentiment analysis the meaning will be very different like food is good and food is not good this very different but if we remove stopwords it will be the same
It helps in semantic meaning that not capture by BOW because he doesn’t give just 1 for word he gives weights for words and give high weighted for rare words because this mean this word will be affect in the meaning of sentence
$$ TF(t) = \frac{\text{number of repetitions of the word in the sentence}}{\text{number of words in the sentence}} $$
$$ IDF(t) = \log_c \frac{\text{number of sentences}}{\text{number of sentences containing the word}} $$
Example:
Sent1: good boy Sent2: good girl Sent3: boy girl good
TF | Sent1 | Sent2 | Sent3 |
---|---|---|---|
good | 1/2 | 1/2 | 1/3 |
boy | 1/2 | 0 | 1/3 |
girl | 0 | 1/2 | 1/3 |
Words | IDF |
---|---|
good | log_e(3/3)=0 |
boy | log_e(3/2) |
girl | log_e(3/2) |
TF-IDF = TF * IDF | good | boy | girl |
---|---|---|---|
sent1 | 0 | 1/2 * log_e(3/2) | 0 |
sent2 | 0 | 0 | 1/2 * log_e(3/2) |
sent3 | 0 | 1/3 * log_e(3/2) | 1/3 * log_e(3/2) |
If we notice good get 0 because it is very common but boy and girl gets mor weight because it is less common
⇒ Advantages → Important Words are getting Capture → Intuitive
⇒ Disadvantages → Sparsity → reduced compared to BOW, but still exists → Out of Vocabulary (oov)