Natural languages are uncertain and ambiguous, so are best represented using probabilities.

N-Gram models

N-Gram character models

These model the probability distribution over sequences of characters.

Note

The term “n-gram” is a generalisation representing a sequence of items. Specifically named n-grams include ‘unigram’ (1-gram), ‘bigram’ (2-gram) and ‘trigram’ (3-gram).

An -gram character model predicts the probability of the occurrence of a consecutive sequence of characters. Probabilities are estimated based on counting sequences in a corpus. The probability of the letter ‘e’ following the sequence ‘th’ may be expressed as:

N-Gram word models

These are similar to N-Gram character models, but use the smallest unit of a word, rather than a character.

Applications

N-gram models have found uses in:

  • Language identification
  • Spelling correction
  • Named entity recognition