Some Terminologies:
- Corpus → All the data
- Documents → One row (sentence)
- Vocabulary → All unique words in data
- Word
NLP Task:
- Dataset
- Text Preprocessing
- Tokenization, Lowering the words
- Stemming, Lemmatization, Stop Words
- Words to Vectors (BOW | TFIDF | Word2Vec)
One Hot Encoding:
- Advantage
- Simple to implement
- Intuitive
- Disadvantage
- Create a sparse matrix (All values is zero except one value is one ) → يخلي التدريب بطيء كبيره RAM ويحتاج .
- OOV → Out Of Vocabulary → خلي الموديل مش قادر يتعامل مع بيانات جديدة بشكل صحيح يعنى لو شاف كلمه جديده متدربش عليها بيعملها كلها اصفار
- Not fixed sentences (size)
- Semantic meaning between words is not captured
BOW (Bag Of Words):
D1 → He is a good boy
D2 → She is a good girl
D3 → Boy and girl are good
- First apply stop words & Lower the characters
- D1 → good boy
- D2 → good girl
- D3 → Boy girl good