Essential Terminology

Explore essential vocabulary used in the realm of natural language processing.

Let’s start by introducing some commonly used terms in natural language processing (NLP) that should provide a good baseline throughout the course. We will continue to add advanced terms later in the course as they appear.

Preprocessing techniques

  • Tokenization: Representing a sentence as a sequence of words, generally this takes the form of either a vector or a list, instead of a string. e.g., representing "I love natural language" as ["I", "love", "natural", "language"].

  • Normalization: Text preprocessing that gets all words in the same form. Examples of this are converting everything to the same case, expanding contractions, removing punctuation, etc.

  • Stemming: Removing affixes from words. e.g., playing would be converted to play.

  • Lemmatization: Converts all words to their base conjugation or form; for example, the word best would be converted to its base, good. Generally, Tokenization, Normalization, Stemming, and Lemmatization are the standard preprocessing steps of an NLP model.

Linguistic analysis

  • Part-of-Speech (PoS) tagging: Marks words as nouns, pronouns, adverbs, etc., based on their context within a sentence.

  • Named entity recognition (NER): Labeling text within the context of a sentence into its likely category such as name, location, time, etc. These are useful in determining the subject of sentences.

  • N-gram: A subsequence of n words or tokens within a sentence. For example, "I love natural language" has these 2-grams: {"I love", "love natural", "natural language"}. A 2-gram is also known as a bigram, and a 3-gram is known as a trigram.

  • Q-gram: Equivalent to an N-gram but for characters instead of tokens.

Language processing concepts

  • Corpus: A collection of written texts generally used to train models.

  • Parallel corpus: A series of text grouped with its translation, such as text with errors and its corresponding corrected version(s).

  • Confusion set: The set of the most probable words that would appear in a certain context, e.g., a set of nouns that could precede a verb.

  • Language model (LM): Determines the probability distribution over a sequence of words. For example, If we start a sentence "The car is" a language model trained on previous data would assume the word "parked" is likely while the word "swimming" is not likely. Hidden Markov Models are an example of a language model.

  • Machine translation (MT): A machine-learning approach to translating one sequence of text into another. In the context of grammar checking, this refers to translating misspelled or misordered text into the correct text. Other examples may be language translators.