Understanding Tokens and N-Grams

Learn about tokens and n-grams and their usage in natural language processing.

What is a token?

In natural language processing (NLP), a token refers to a sequence of characters that represents a meaningful unit of text. Essentially, tokens are a way of breaking up raw text into smaller, more manageable pieces that can be analyzed and processed by computer algorithms. This process is known as tokenization, and it forms the foundation of many NLP tasks, such as sentiment analysis, machine translation, and named entity recognition.

Tokenization can involve splitting text into individual words, as well as identifying punctuation, capitalization, and other linguistic features that may affect the meaning of a given sentence or document. By breaking text into its constituent tokens, NLP algorithms can better understand and interpret the meaning of natural language text.

The following code illustrates tokenization using the tm package. Click "Run" to see the result.

library(tm, quietly = TRUE)
sampleText <- as.String("In natural language processing (NLP),
a token refers to a sequence of characters that represents
a meaningful unit of text.")
# produces bag of words -----------
Boost_tokenizer(sampleText)

The above code has two significant parts:

  • Line 3: We define a vector titled sampleText to contain a string.

  • Line 8: We use a Boost_tokenizer( ) to convert sampleText to a collection of tokens, and in this case, individual words.

Tokens can be more than just words

Tokens aren’t just words. Depending on our problem domain, it may be best to break documents into phrases, sentences, lines, or paragraphs. Here’s an example of tokenizing each line of a document:

library(tokenizers)
sampleText <- "Tokens aren't just words.
Depending on your problem domain, it may be best to break
documents into phrases, sentences, lines, or paragraphs.
Here's an example of tokenizing each line of a document."
# tokenize by lines -----------
tokenize_lines(sampleText)
  • Line 9: We’ve used tokenize_lines here instead of Boost_tokenizer to produce tokens of lines.

Understanding n-grams

Tokens and n-grams represent different ways of breaking text into smaller units.

A token is a sequence of characters that represents a meaningful unit of text. In most cases, tokens are individual words or punctuation marks that have been separated from one another through a process called tokenization.

An n-gram is a contiguous sequence of n items from a given text, where an item can be a word, a character, or even a sentence.

Understanding N-grams

Text

N-gram

The

1-gram

The quick

2-gram

The quick brown

3-gram

The quick brown fox

4-gram

The quick brown fox jumps

5-gram

Here’s an example that creates n-grams:

library(tm)
"an n-gram is a contiguous sequence of n items from a given text" |>
Boost_tokenizer() |>
ngrams(n = 3)

There are two things we should note about the code sample above:

  • Line 4: We use |> , which is called the pipe operator. It was introduced in R version 4.1.0 and allows for easier and more readable code by chaining together a series of operations in a left-to-right sequence. The pipe operator takes the output of the expression on the left and passes it as the first argument to the function on the right. For example, x |> f(y, z) is equivalent to f(x, y, z). It’s similar to the pipe operator %>% available in the magrittr package.

ngrams(3) produces three-word phrases (or trigrams). Please experiment with this code by changing the value of n.

Why use n-grams?

N-grams and tokens serve different purposes in natural language processing, and neither is inherently better than the other. That being said, n-grams do offer some advantages over tokens in certain contexts:

  • N-grams capture more context: N-grams, especially bigrams and trigrams, capture more context and can give more information about the relationships between words than individual tokens. This can be useful for tasks like language modeling and predicting the next word in a sentence.

  • N-grams are better for certain types of analysis: In some cases, analyzing sequences of n-grams can be more informative than analyzing individual tokens. For example, analyzing the frequencies of bigrams or trigrams can help identify common collocations or phrase structures in a text.

  • N-grams can help with spelling and OCR errors: In cases where there are spelling or OCR errors in a text, using n-grams can help mitigate the impact of those errors. For example, if a document contains the word “thee” instead of “the,” using trigrams can help identify that the incorrect word is part of a larger context (“thee quick brown”) that is more likely to be correct.

  • N-grams can improve performance in certain NLP tasks: In some NLP tasks, such as text classification or sentiment analysis, using n-grams, in addition to tokens, can improve the accuracy of the model by capturing more information about the text.

In general, whether to use n-grams or tokens depends on the specific task and the characteristics of the text being analyzed. Both approaches have their strengths and weaknesses, and the best approach will depend on the requirements of our problem domain.