Pre-processing text
Common pre-processing operations include:
- Casefolding - normalise text case
- Punctuation removal
- Stop word removal
- Word normalisation:
- Stemming
- Lemmatisation
- Tokenisation
Note
Stop words are commonly-occurring words which do not bear any specific, instrinsic meaning.
Normalistion attempts to merge occurrences of variants of a common stem word, such as “play”, “playing”, “played” etc.
Stemming does this by stripping suffixes to retrieve the root, which may or may not itself be a ‘valid’ word. Lemmatisation however, identifies the lemma of the word, which is guaranteed to be a valid word.
Text classification
Text classification seeks to categorise text into one of a set of classes.
A Bag of Words model represents each input document as a set of pairs, where the key is the unigram token (character/word) and the value is the number of occurrences in the document. This approach does not maintain any notion of unigram order.
Information retrieval
Information retrieval (IR) is the process of finding a presenting a set of results from a corpus according to a search query.
Given a query language representing a boolean expression over unigrams (eg. conjunction of words), replace each unigram in the expression with true if the unigram appears in a given document, and false otherwise, and evaluate the resulting expression to determine a match. This method does not provide any obvious means for ordering of results. In order to allow ordering of results, we prefer a score-based system. A common scoring function, BM25, is defined for a document and query as:
This formula requires the definition of some common calculations:
- Term frequency - the number of occurrences of unigram in document .
- Document frequency \operatorname{DF}(q)$$ - the number of documents containing the unigram q$.
- Inverse document frequency defined for a corpus of documents as
- Length of a document .
We also require some specific values to compute scores using this formula:
- - the average length of documents in the corpus.
- - a parameter typically given the value .
- - a parameter typically given the value .
Information Extraction
Information extraction (IE) is the process of finding fine-grained information relevant to a query. Methods for IE include:
- Regular Expressions (Regex)/Finite state automata
- Probabilistic models
- Conditional random fields