Understanding Tokens and N-Grams

What is a token?

In natural language processing (NLP), a token refers to a sequence of characters that represents a meaningful unit of text. Essentially, tokens are a way of breaking up raw text into smaller, more manageable pieces that can be analyzed and processed by computer algorithms. This process is known as tokenization, and it forms the foundation of many NLP tasks, such as sentiment analysis, machine translation, and named entity recognition.

Tokenization can involve splitting text into individual words, as well as identifying punctuation, capitalization, and other linguistic features that may affect the meaning of a given sentence or document. By breaking text into its constituent tokens, NLP algorithms can better understand and interpret the meaning of natural language text.

The following code illustrates tokenization using the tm package. Click "Run" to see the result.

Line 9: We’ve used tokenize_lines here instead of Boost_tokenizer to produce tokens of lines.

Understanding n-grams

Tokens and n-grams represent different ways of breaking text into smaller units.

A token is a sequence of characters that represents a meaningful unit of text. In most cases, tokens are individual words or punctuation marks that have been separated from one another through a process called tokenization.

An n-gram is a contiguous sequence of n items from a given text, where an item can be a word, a character, or even a sentence.

There are two things we should note about the code sample above:

Line 4: We use |> , which is called the pipe operator. It was introduced in R version 4.1.0 and allows for easier and more readable code by chaining together a series of operations in a left-to-right sequence. The pipe operator takes the output of the expression on the left and passes it as the first argument to the function on the right. For example, x |> f(y, z) is equivalent to f(x, y, z). It’s similar to the pipe operator %>% available in the magrittr package.

ngrams(3) produces three-word phrases (or trigrams). Please experiment with this code by changing the value of n.

Why use n-grams?

N-grams and tokens serve different purposes in natural language processing, and neither is inherently better than the other. That being said, n-grams do offer some advantages over tokens in certain contexts:

N-grams capture more context: N-grams, especially bigrams and trigrams, capture more context and can give more information about the relationships between words than individual tokens. This can be useful for tasks like language modeling and predicting the next word in a sentence.
N-grams are better for certain types of analysis: In some cases, analyzing sequences of n-grams can be more informative than analyzing individual tokens. For example, analyzing the frequencies of bigrams or trigrams can help identify common collocations or phrase structures in a text.
N-grams can help with spelling and OCR errors: In cases where there are spelling or OCR errors in a text, using n-grams can help mitigate the impact of those errors. For example, if a document contains the word “thee” instead of “the,” using trigrams can help identify that the incorrect word is part of a larger context (“thee quick brown”) that is more likely to be correct.
N-grams can improve performance in certain NLP tasks: In some NLP tasks, such as text classification or sentiment analysis, using n-grams, in addition to tokens, can improve the accuracy of the model by capturing more information about the text.

In general, whether to use n-grams or tokens depends on the specific task and the characteristics of the text being analyzed. Both approaches have their strengths and weaknesses, and the best approach will depend on the requirements of our problem domain.

Text	N-gram
The	1-gram
The quick	2-gram
The quick brown	3-gram
The quick brown fox	4-gram
The quick brown fox jumps	5-gram

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

What is a token?

Tokens can be more than just words

Understanding n-grams

Understanding N-grams

Why use n-grams?