Compilers translate programs written in high-level languages into (typically) assembly code. They are likely to be used as part of a toolchain including Preprocessors, Assemblers and Linkers.

Functions of the compiler can be categorised into two main sections, Analysis (usually referred to as front-end) and Synthesis (usually referred to as back-end).

Analysis/Front-End

There are three main stages performed by the front-end:

Lexical Analysis

The Lexical Analyser (usually referred to as the lexer) generates a list of tokens from the input text. A token is an individual unit of code. For example, an identifier would be a single token, as would a literal, an operator or a comma. Each token comprises two main pieces of information:

The token type (or token class) such as identifier, string or comma.
The token value (or token attribute), if applicable, such as print or Hello, World!. The string in the source code that the token was built from is known as the lexeme.

Token value vs lexeme

The token value may not always be the same as the lexeme for that token. For example, the lexeme "Hello, World!" will likely generate a token with a value Hello, World!, without the quotation marks. Similarly, a token with a value of 10 could have the lexeme 10, 000010, 0xA etc. This minor distinction is rarely taught but it is important to understand the conceptual difference between the two.

A Symbol Table is sometimes created at this stage, with limited information.

Syntax Analysis

The Syntax Analyser (usually referred to as the parser) uses the token stream from the lexer to generate a Syntax Tree

Abstract vs Concrete Syntax Trees

A Concrete Syntax Tree (CST) represents the parsed text exactly, and could be used to recreate the source. An Abstract Syntax Tree (AST) is a simplified CST that preserves only the semantics of the code.

This stage ensures that the input follows the grammatical rules for the language. The symbol table is populated during this stage.

Semantic Analysis

The Semantic Analyser traverses the syntax tree generated by the parser to ensure semantic consistency with the language specification. Type-checking and function signatures will be checked and enforced at this stage. The syntax tree and symbol table may be updated with further information (such as types) but no new data structure is created at this stage.

Synthesis/Back-End

The number of stages performed by the back-end is not as well-defined or standardised as the front-end. However, the following four steps are usually performed:

Intermediate Representation Generation

An Intermediate Representation (IR) is generated from the syntax tree and symbol table. This is a platform-independent, low-level representation which is easy to generate, manipulate, and translate.

Target-Independent Optimisation

Any optimisations that can be made directly to the IR are applied here. Typically, these will be logical optimisations such as:

Function in-lining
Unreachable code elimination
Compile-time evaluation
Tail-recursion optimisation

Code Generation

At this stage, the IR is translated to the target language. This is usually the first completely target-dependent stage.

Target-Dependent Optimisation

It may be possible to optimise specific sections of the output due to the features of the target language. For example, some branching instructions in may be replaced with conditional execution if the target is within a range of versions of ARM Assembly. Equally, some Arithmetic Operations may be replaced with Shifts and Rotations.

CS Notes

Explorer

Compilers

Analysis/Front-End

Lexical Analysis

Syntax Analysis

Semantic Analysis

Synthesis/Back-End

Intermediate Representation Generation

Target-Independent Optimisation

Code Generation

Target-Dependent Optimisation

Graph View

Table of Contents

Backlinks

CS Notes

Explorer

Compilers

Analysis/Front-End §

Lexical Analysis §

Syntax Analysis §

Semantic Analysis §

Synthesis/Back-End §

Intermediate Representation Generation §

Target-Independent Optimisation §

Code Generation §

Target-Dependent Optimisation §

Graph View

Table of Contents

Backlinks

Analysis/Front-End

Lexical Analysis

Syntax Analysis

Semantic Analysis

Synthesis/Back-End

Intermediate Representation Generation

Target-Independent Optimisation

Code Generation

Target-Dependent Optimisation