Day 2 | Notion

Some Terminologies:

Corpus → All the data
Documents → One row (sentence)
Vocabulary → All unique words in data
Word

NLP Task:

Dataset
Text Preprocessing
1. Tokenization, Lowering the words
2. Stemming, Lemmatization, Stop Words
Words to Vectors (BOW | TFIDF | Word2Vec)

One Hot Encoding:

Advantage
- Simple to implement
- Intuitive
Disadvantage
- Create a sparse matrix (All values is zero except one value is one ) → يخلي التدريب بطيء كبيره RAM ويحتاج .
- OOV → Out Of Vocabulary → خلي الموديل مش قادر يتعامل مع بيانات جديدة بشكل صحيح يعنى لو شاف كلمه جديده متدربش عليها بيعملها كلها اصفار
- Not fixed sentences (size)
- Semantic meaning between words is not captured

BOW (Bag Of Words):

D1 → He is a good boy

D2 → She is a good girl

D3 → Boy and girl are good

First apply stop words & Lower the characters
- D1 → good boy
- D2 → good girl
- D3 → Boy girl good