Note:

When we remove stopwords we must take care about the negative because if we used in a sentiment analysis the meaning will be very different like food is good and food is not good this very different but if we remove stopwords it will be the same

TF-IDF (Term Frequency and Inverse Document Frequency):

It helps in semantic meaning that not capture by BOW because he doesn’t give just 1 for word he gives weights for words and give high weighted for rare words because this mean this word will be affect in the meaning of sentence

$$ TF(t) = \frac{\text{number of repetitions of the word in the sentence}}{\text{number of words in the sentence}} $$

$$ IDF(t) = \log_c \frac{\text{number of sentences}}{\text{number of sentences containing the word}} $$

Example:

Sent1: good boy Sent2: good girl Sent3: boy girl good

TF Sent1 Sent2 Sent3
good 1/2 1/2 1/3
boy 1/2 0 1/3
girl 0 1/2 1/3
Words IDF
good log_e(3/3)=0
boy log_e(3/2)
girl log_e(3/2)
TF-IDF = TF * IDF good boy girl
sent1 0 1/2 * log_e(3/2) 0
sent2 0 0 1/2 * log_e(3/2)
sent3 0 1/3 * log_e(3/2) 1/3 * log_e(3/2)

If we notice good get 0 because it is very common but boy and girl gets mor weight because it is less common

Advantages → Important Words are getting Capture → Intuitive

Disadvantages → Sparsity → reduced compared to BOW, but still exists → Out of Vocabulary (oov)