Introduction to Abstract Syntax Trees

Abstract Syntax Trees or ASTs are tree representations of code. They are a fundamental part of the way a compiler works. When a compiler transforms some code there are fundamentally the following steps:

1- Lexical Analysis aka Tokenization

During this step, code that you wrote is going to be converted into a set of tokens describing the different parts of your code. This is fundamentally the same method that basic syntax highlighting is using. Tokens don't understand how things fit together and purely focuses on components of a file.

2 - Syntax Analysis aka Parsing

This is the step where we turn our list of tokens into an Abstract Syntax Tree. It converts our tokens into a tree that represents the actual structure of the code. Where previously in the tokens we only had a pair of () we now have an idea of whether it's a function call, a function definition, a grouping or something else.

The equivalent here would be turning our list of words into a data structure representing things such as sentences, what role a certain noun plays in a sentence or whether we are in a listing.

*The Document Object Model (DOM) is the data representation of the objects that comprise the structure and content of a document on the web.

3- Code Generation

Manipulating text is always dangerous; it shows the least amount of context. If you ever tried to manipulate text using string replacements or regular expressions you probably noticed how easy it is to get it wrong.

Even manipulating tokens isn't easy. While we might know what a variable is, if we would want to rename it we would have no insight into things like the scope of the variable or any variables it might clash with.

Once we have manipulated the tree we can then print the tree to output whatever code output is expected.

Reverse Polish Grows on Trees - Computerphile