Basic Terminologies word in NLP:

One Hot Encoding :

we don’t prefer to use this technique in text Because text usually has thousands of unique words → the vector size becomes huge and very sparse (mostly zeros).

Advantages → very simple → Intuitive (directly shows word presence)

Disadvantages → Sparce Matrix will be very huge → needs lots of memory → Out of vocabulary (oov) → can’t handle words not seen during training → Not Fixed → adding new word change the whole vector size → semantic meaning between words not capture

Bag Of Words (BOW):

It converts text into a numerical representation based on the frequency of each unique word in the corpus. ex→

D1 → He is a good boy D2 → She is a good girl D3 → boy and girl are good

Delete stop words (the word have underline) and convert all text to lower case the words we will get ⇒ [ good, boy, girl]

f1 f2 f3
good boy girl
D1 1 1 0
D2 1 0 1
D3 1 1 1

Binary BOW: Instead of storing the count, we store 1 if the word appears (even if more than once) and 0 if it doesn’t.

Frequency BOW:

Counts how many times each word appears in the document.

⇒ Advantages → Simple → Intuitive