Natural languages are uncertain and ambiguous, so are best represented using probabilities.
N-Gram models
N-Gram character models
These model the probability distribution over sequences of characters.
Note
The term “n-gram” is a generalisation representing a sequence of items. Specifically named n-grams include ‘unigram’ (1-gram), ‘bigram’ (2-gram) and ‘trigram’ (3-gram).
An -gram character model predicts the probability of the occurrence of a consecutive sequence of characters. Probabilities are estimated based on counting sequences in a corpus. The probability of the letter ‘e’ following the sequence ‘th’ may be expressed as:
N-Gram word models
These are similar to N-Gram character models, but use the smallest unit of a word, rather than a character.
Applications
N-gram models have found uses in:
- Language identification
- Spelling correction
- Named entity recognition