Tokens - Tokens are the basic units of text that a large language model (LLM) processes, representing words, parts of words, or punctuation marks.

image.png

image.png

image.png

                                                                                                |

                                                                                          |

image.png

Q.) Why tokens differ between model providers ?

Different model providers treat the input differently and produce different number of tokens based on the training as below.

image.png

Q.) How a tokenizer behaves when it encounters an unusual words?

LLM Tokenizer Unknown Word Rule:

  1. Try matching the whole word.
  2. If unknown, split into largest known subwords (BPE behavior).