Compilers translate programs written in high-level languages into (typically) assembly code. They are likely to be used as part of a toolchain including Preprocessors, Assemblers and Linkers.

Functions of the compiler can be categorised into two main sections, Analysis (usually referred to as front-end) and Synthesis (usually referred to as back-end).

Analysis/Front-End

There are three main stages performed by the front-end:

Lexical Analysis

The Lexical Analyser (usually referred to as the lexer) generates a list of tokens from the input text. A token is an individual unit of code. For example, an identifier would be a single token, as would a literal, an operator or a comma. Each token comprises two main pieces of information:

  • The token type (or token class) such as identifier, string or comma.
  • The token value (or token attribute), if applicable, such as print or Hello, World!. The string in the source code that the token was built from is known as the lexeme.

A Symbol Table is sometimes created at this stage, with limited information.

Syntax Analysis

The Syntax Analyser (usually referred to as the parser) uses the token stream from the lexer to generate a Syntax Tree

This stage ensures that the input follows the grammatical rules for the language. The symbol table is populated during this stage.

Semantic Analysis

The Semantic Analyser traverses the syntax tree generated by the parser to ensure semantic consistency with the language specification. Type-checking and function signatures will be checked and enforced at this stage. The syntax tree and symbol table may be updated with further information (such as types) but no new data structure is created at this stage.

Synthesis/Back-End

The number of stages performed by the back-end is not as well-defined or standardised as the front-end. However, the following four steps are usually performed:

Intermediate Representation Generation

An Intermediate Representation (IR) is generated from the syntax tree and symbol table. This is a platform-independent, low-level representation which is easy to generate, manipulate, and translate.

Target-Independent Optimisation

Any optimisations that can be made directly to the IR are applied here. Typically, these will be logical optimisations such as:

  • Function in-lining
  • Unreachable code elimination
  • Compile-time evaluation
  • Tail-recursion optimisation

Code Generation

At this stage, the IR is translated to the target language. This is usually the first completely target-dependent stage.

Target-Dependent Optimisation

It may be possible to optimise specific sections of the output due to the features of the target language. For example, some branching instructions in may be replaced with conditional execution if the target is within a range of versions of ARM Assembly. Equally, some Arithmetic Operations may be replaced with Shifts and Rotations.