Search⌘ K
AI Features

Preserve Phrases with N-grams

Explore how to preserve important phrases in text by applying n-gram techniques in R. Learn to preprocess text, generate and sort n-grams, and identify frequent phrase patterns using real examples. This lesson helps you enhance text analysis by capturing multiword expressions essential for deeper natural language processing.

N-grams are phrases

Tokenization can be adjusted to respect lines, sentences, and paragraphs as well as words. But what about phrases? For example, “Frankenstein's monster” and “philosopher’s stone” are both phrases characteristic of Mary Shelley’s writing. Neither of them would be broken out by the tokenization strategies we’ve discussed so far. Instead, they require a strategy called n-grams.

Most frequent phrases

For our work with Mary Shelley, it’ll be helpful to know a list of the most frequent phrases. The following code produces this list:

R
library(tm, quietly = TRUE)
library(readtext, quietly = TRUE)
# This removes Project Gutenberg header and tail -----------
shelleyText <- readtext("data/mws_*.txt")
shelleyText <- iconv(shelleyText$text, "UTF-8", sub = '')
# *** START OF THE PROJECT GUTENBERG EBOOK ??? ***
# useful text is between these two lines
# *** END OF THE PROJECT GUTENBERG EBOOK ??? ***
fromHere <- regexpr(pattern = ' \\*{3}\n', text = shelleyText)
toHere <- regexpr(pattern = '\\*{3} END', text = shelleyText)
for (index in 1:length(shelleyText)) {
shelleyText[index] <- substr(shelleyText[index],
start = fromHere[index] + attr(fromHere, which = "match.length")[1],
stop = toHere[index])
}
# This displays leading n-grams ------------------------
shelleyText |>
removePunctuation() |>
removeWords(stopwords('english')) |>
removeWords(c("I")) |>
removeNumbers() |>
stripWhitespace() |>
Boost_tokenizer() |>
ngrams(n = 3) |>
vapply(paste, "", collapse=" ") |>
table() |>
sort(decreasing = TRUE) |>
head(n = 10)

This results in a list of the most-used tri-grams. These might be useful in our search for forums Mary Shelley would be wise to use for promotion.

There is a lot to unpack in this code—but there is also a lot to learn. It builds on ...