Natural languages are uncertain and ambiguous, so are best represented using probabilities.

N-Gram models

N-Gram character models

These model the probability distribution over sequences of characters.

Note

The term “n-gram” is a generalisation representing a sequence of $n$ items. Specifically named n-grams include ‘unigram’ (1-gram), ‘bigram’ (2-gram) and ‘trigram’ (3-gram).

An $n$ -gram character model predicts the probability of the occurrence of a consecutive sequence of $n$ characters. Probabilities are estimated based on counting sequences in a corpus. The probability of the letter ‘e’ following the sequence ‘th’ may be expressed as:

P (^{'} e^{'} ∣^{'} t h^{'}) = \frac{C o u n t ( ^{'} t h e ^{'} )}{C o u n t ( ^{'} t h ^{'} )}

N-Gram word models

These are similar to N-Gram character models, but use the smallest unit of a word, rather than a character.

Applications

N-gram models have found uses in:

Language identification
Spelling correction
Named entity recognition

CS Notes

Explorer

Language Models

N-Gram models

N-Gram character models

N-Gram word models

Applications

Graph View

Table of Contents

Backlinks

CS Notes

Explorer

Language Models

N-Gram models §

N-Gram character models §

N-Gram word models §

Applications §

Graph View

Table of Contents

Backlinks

N-Gram models

N-Gram character models

N-Gram word models

Applications