we don’t prefer to use this technique in text Because text usually has thousands of unique words → the vector size becomes huge and very sparse (mostly zeros).
⇒ Advantages → very simple → Intuitive (directly shows word presence)
⇒ Disadvantages → Sparce Matrix will be very huge → needs lots of memory → Out of vocabulary (oov) → can’t handle words not seen during training → Not Fixed → adding new word change the whole vector size → semantic meaning between words not capture
It converts text into a numerical representation based on the frequency of each unique word in the corpus. ex→
D1 → He is a good boy D2 → She is a good girl D3 → boy and girl are good
Delete stop words (the word have underline) and convert all text to lower case the words we will get ⇒ [ good, boy, girl]
f1 | f2 | f3 | |
---|---|---|---|
good | boy | girl | |
D1 | 1 | 1 | 0 |
D2 | 1 | 0 | 1 |
D3 | 1 | 1 | 1 |
Binary BOW: Instead of storing the count, we store 1 if the word appears (even if more than once) and 0 if it doesn’t.
Frequency BOW:
Counts how many times each word appears in the document.
⇒ Advantages → Simple → Intuitive