Text preprocessing 1
- Tokenization
- convert sentences into words
- Lemmatization
- try to find the base of a specific words but should have a meaning
- slow
- EX: Text summarization, ChatBots
- Stop words
- (to, of, the, a, an, …..)
- you can create your own list
- we remove them
- Stemming
- try to find the base of a specific words (root word)
- the base word may have no meaning
- fast
- EX: spam filter, comments classification
Text preprocessing 2
- BOW
- TFIDF
- Term frequency and inverse document frequency
- Unigrams
- Bigrams
Text preprocessing 3