Compilers translate programs written in high-level languages into (typically) assembly code. They are likely to be used as part of a toolchain including Preprocessors, Assemblers and Linkers.
Functions of the compiler can be categorised into two main sections, Analysis (usually referred to as front-end) and Synthesis (usually referred to as back-end).
Analysis/Front-End
There are three main stages performed by the front-end:
Lexical Analysis
The Lexical Analyser (usually referred to as the lexer) generates a list of tokens from the input text. A token is an individual unit of code. For example, an identifier would be a single token, as would a literal, an operator or a comma. Each token comprises two main pieces of information:
- The token type (or token class) such as
identifier,stringorcomma. - The token value (or token attribute), if applicable, such as
printorHello, World!. The string in the source code that the token was built from is known as the lexeme.
Token value vs lexeme
The token value may not always be the same as the lexeme for that token. For example, the lexeme
"Hello, World!"will likely generate a token with a valueHello, World!, without the quotation marks. Similarly, a token with a value of10could have the lexeme10,000010,0xAetc. This minor distinction is rarely taught but it is important to understand the conceptual difference between the two.
A Symbol Table is sometimes created at this stage, with limited information.
Syntax Analysis
The Syntax Analyser (usually referred to as the parser) uses the token stream from the lexer to generate a Syntax Tree
Abstract vs Concrete Syntax Trees
A Concrete Syntax Tree (CST) represents the parsed text exactly, and could be used to recreate the source. An Abstract Syntax Tree (AST) is a simplified CST that preserves only the semantics of the code.
This stage ensures that the input follows the grammatical rules for the language. The symbol table is populated during this stage.
Semantic Analysis
The Semantic Analyser traverses the syntax tree generated by the parser to ensure semantic consistency with the language specification. Type-checking and function signatures will be checked and enforced at this stage. The syntax tree and symbol table may be updated with further information (such as types) but no new data structure is created at this stage.
Synthesis/Back-End
The number of stages performed by the back-end is not as well-defined or standardised as the front-end. However, the following four steps are usually performed:
Intermediate Representation Generation
An Intermediate Representation (IR) is generated from the syntax tree and symbol table. This is a platform-independent, low-level representation which is easy to generate, manipulate, and translate.
Target-Independent Optimisation
Any optimisations that can be made directly to the IR are applied here. Typically, these will be logical optimisations such as:
- Function in-lining
- Unreachable code elimination
- Compile-time evaluation
- Tail-recursion optimisation
Code Generation
At this stage, the IR is translated to the target language. This is usually the first completely target-dependent stage.
Target-Dependent Optimisation
It may be possible to optimise specific sections of the output due to the features of the target language. For example, some branching instructions in may be replaced with conditional execution if the target is within a range of versions of ARM Assembly. Equally, some Arithmetic Operations may be replaced with Shifts and Rotations.