Tokenizing + padding

📙 Notebook: Tokenizer basic examples. 📙 Notebook: Sarcasm detection.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog so much!'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
            # num_words: max of words to be tokenized & pick
            #   the most common 100 words.
            # More words, more accuracy, more time to train
            # oov_token: replace unseen words by "<OOV>"
tokenizer.fit_on_texts(sentences) # fix texts based on tokens
# indexing words
word_index = tokenizer.word_index
print(word_index)
# {'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7, 'so': 8, 'much': 9}
# "!", ",", capital, ... are removed

👉 tf.keras.preprocessing.text.Tokenizer

# encode sentences
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
# [[4, 2, 3, 5],
#  [4, 2, 3, 6],
#  [7, 2, 3, 5, 8, 9]]
# if a word is not in the word index, it will be lost in the text_to_sequences()

👉 tf.keras.preprocessing.sequence.pad_sequences

# make encoded sentences equal
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences, value=-1,
                       maxlen=5, padding="post", truncating="post")
         # maxlen: max len of encoded sentence
         # value: value to be filld (default 0)
         # padding: add missing values at beginning or ending of sentence?
         # truncating: longer than maxlen? cut at beginning or ending?
print(padded)
# [[ 4  2  3  5 -1]
#  [ 4  2  3  6 -1]
#  [ 7  2  3  5  8]]

👉 Sarcasm detection dataset.

# read json text
import json
with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []
urls = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

Word embeddings

👉 Embedding projector - visualization of high-dimensional data 👉 Large Movie Review Dataset

IMDB review dataset

📙 Notebook: Train IMDB review dataset. 👉 Video explain the code.