What does it do?

The tokenizer takes in the base input code as a string, and breaks it down into a list of tokens.

Each token represents the smallest unbreakable part of the code. Think of it as atoms - breaking tokens down further would be impossible/useless.

This section does not check for syntax or anything like that, it just generates the tokens. Actual syntactic evaluation happens in the parsing stage.

The Goal

In our example, there are only four types of tokens, each of them has a "type" that identifies what kind of token it is, and the contents of the token:

interface ParenToken {
  type: "paren";
  value: "(" | ")";
}

interface NumberToken {
  type: "number";
  value: string;
}

interface StringToken {
  type: "string";
  value: string;
}

interface NameToken {
  type: "name";
  value: string;
}

type Token = ParenToken | NumberToken | StringToken | NameToken;

Example

(concat "Number: " (add 2 3 ) )

Will become

[
	{type: "paren", value: "("},
	{type: "name", value: "concat"},
	{type: "string", value: "Number: "},
	{type: "paren", value: "("},
	{type: "name", value: "add"},
	{type: "paren", value: "("},
	{type: "number", value: "2"},
	{type: "number", value: "3"},
	{type: "paren", value: ")"},
	{type: "paren", value: ")"},
]

Breakdown

Each parenthesis becomes a token of type paren and a value of ) or (.

Every function name becomes a token of type name and the value is the function name itself.

Every number and every string become a token of type number or string respectively, the value becomes what the number or string is.

Takeaway

This is probably the easiest or second easiest part, being ties with the Code Generation step.

The Tokenizer function itself is fairly simple as well:

  1. Identify first character type
  2. If need be, consume the rest of the word
    1. The entire string "abc" instead of just the character "a"