Pre-processing text

Common pre-processing operations include:

Casefolding - normalise text case
Punctuation removal
Stop word removal
Word normalisation:
- Stemming
- Lemmatisation
Tokenisation

Note

Stop words are commonly-occurring words which do not bear any specific, instrinsic meaning.

Normalistion attempts to merge occurrences of variants of a common stem word, such as “play”, “playing”, “played” etc.

Stemming does this by stripping suffixes to retrieve the root, which may or may not itself be a ‘valid’ word. Lemmatisation however, identifies the lemma of the word, which is guaranteed to be a valid word.

Text classification

Text classification seeks to categorise text into one of a set of classes.

A Bag of Words model represents each input document as a set of pairs, where the key is the unigram token (character/word) and the value is the number of occurrences in the document. This approach does not maintain any notion of unigram order.

Information retrieval

Information retrieval (IR) is the process of finding a presenting a set of results from a corpus according to a search query.

Given a query language representing a boolean expression over unigrams (eg. conjunction of words), replace each unigram in the expression with true if the unigram appears in a given document, and false otherwise, and evaluate the resulting expression to determine a match. This method does not provide any obvious means for ordering of results. In order to allow ordering of results, we prefer a score-based system. A common scoring function, BM25, is defined for a document $d_{j}$ and query $q$ as:

BM25 (d_{j}, q_{1 : N}) = i = 1 \sum N IDF (q_{i}) \cdot \frac{TF ( q _{i} , d _{j} ) \cdot ( k + 1 )}{TF ( q _{i} , d _{j} ) + k \cdot ( 1 - b + b \cdot \frac{∣ d _{j} ∣}{L} )}

This formula requires the definition of some common calculations:

Term frequency $TF (q, d)$ - the number of occurrences of unigram $q$ in document $d$ .
Document frequency $\operatorname{DF}(q)$$ - the number of documents containing the unigram$ q$.
Inverse document frequency defined for a corpus of $N$ documents as

IDF (q) = lo g \frac{N - DF ( q ) + 0.5}{DF ( q ) + 0.5}

Length $∣ d ∣$ of a document $d$ .

We also require some specific values to compute scores using this formula:

$L$ - the average length of documents in the corpus.
$k$ - a parameter typically given the value $2$ .
$b$ - a parameter typically given the value $0.75$ .

Information Extraction

Information extraction (IE) is the process of finding fine-grained information relevant to a query. Methods for IE include:

Regular Expressions (Regex)/Finite state automata
Probabilistic models
Conditional random fields

CS Notes

Explorer

Knowledge Acquisition

Pre-processing text

Text classification

Information retrieval

Information Extraction

Graph View

Table of Contents

Backlinks

CS Notes

Explorer

Knowledge Acquisition

Pre-processing text §

Text classification §

Information retrieval §

Information Extraction §

Graph View

Table of Contents

Backlinks

Pre-processing text

Text classification

Information retrieval

Information Extraction