Day 1 | Notion

Text preprocessing 1

Tokenization
- convert sentences into words
Lemmatization
- try to find the base of a specific words but should have a meaning
- slow
- EX: Text summarization, ChatBots
Stop words
- (to, of, the, a, an, …..)
- you can create your own list
- we remove them
Stemming
- try to find the base of a specific words (root word)
- the base word may have no meaning
- fast
- EX: spam filter, comments classification

Text preprocessing 2

BOW
TFIDF
- Term frequency and inverse document frequency
Unigrams
Bigrams

Text preprocessing 3

Word2Vec
AvgWord2Vec