Video -2 | Notion

Basic Terminologies word in NLP:

Corpus → Paragraph or mean the data which contain sentences
Documents → The sentence (one row)
Vocabulary → The total number of unique words in whole corpus
Word → An individual token from the text

One Hot Encoding :

we don’t prefer to use this technique in text Because text usually has thousands of unique words → the vector size becomes huge and very sparse (mostly zeros).

⇒ Advantages → very simple → Intuitive (directly shows word presence)

⇒ Disadvantages → Sparce Matrix will be very huge → needs lots of memory → Out of vocabulary (oov) → can’t handle words not seen during training → Not Fixed → adding new word change the whole vector size → semantic meaning between words not capture

Bag Of Words (BOW):

It converts text into a numerical representation based on the frequency of each unique word in the corpus. ex→

D1 → He is a good boy D2 → She is a good girl D3 → boy and girl are good

Delete stop words (the word have underline) and convert all text to lower case the words we will get ⇒ [ good, boy, girl]

	f1	f2	f3
	good	boy	girl
D1	1	1	0
D2	1	0	1
D3	1	1	1

Binary BOW: Instead of storing the count, we store 1 if the word appears (even if more than once) and 0 if it doesn’t.

Frequency BOW:

Counts how many times each word appears in the document.

⇒ Advantages → Simple → Intuitive