Video-3 | Notion

Note:

When we remove stopwords we must take care about the negative because if we used in a sentiment analysis the meaning will be very different like food is good and food is not good this very different but if we remove stopwords it will be the same

TF-IDF (Term Frequency and Inverse Document Frequency):

It helps in semantic meaning that not capture by BOW because he doesn’t give just 1 for word he gives weights for words and give high weighted for rare words because this mean this word will be affect in the meaning of sentence

$$ TF(t) = \frac{\text{number of repetitions of the word in the sentence}}{\text{number of words in the sentence}} $$

$$ IDF(t) = \log_c \frac{\text{number of sentences}}{\text{number of sentences containing the word}} $$

Example:

Sent1: good boy Sent2: good girl Sent3: boy girl good

TF	Sent1	Sent2	Sent3
good	1/2	1/2	1/3
boy	1/2	0	1/3
girl	0	1/2	1/3

Words	IDF
good	log_e(3/3)=0
boy	log_e(3/2)
girl	log_e(3/2)

*TF-IDF = TF IDF**	boy	girl
sent1	1/2 * log_e(3/2)	0
sent2	0	1/2 * log_e(3/2)
sent3	1/3 * log_e(3/2)	1/3 * log_e(3/2)

If we notice good get 0 because it is very common but boy and girl gets mor weight because it is less common

⇒ Advantages → Important Words are getting Capture → Intuitive

⇒ Disadvantages → Sparsity → reduced compared to BOW, but still exists → Out of Vocabulary (oov)