Preserve Phrases with N-grams
Explore how to preserve important phrases in text by applying n-gram techniques in R. Learn to preprocess text, generate and sort n-grams, and identify frequent phrase patterns using real examples. This lesson helps you enhance text analysis by capturing multiword expressions essential for deeper natural language processing.
We'll cover the following...
N-grams are phrases
Tokenization can be adjusted to respect lines, sentences, and paragraphs as well as words. But what about phrases? For example, “Frankenstein's monster” and “philosopher’s stone” are both phrases characteristic of Mary Shelley’s writing. Neither of them would be broken out by the tokenization strategies we’ve discussed so far. Instead, they require a strategy called n-grams.
Most frequent phrases
For our work with Mary Shelley, it’ll be helpful to know a list of the most frequent phrases. The following code produces this list:
This results in a list of the most-used tri-grams. These might be useful in our search for forums Mary Shelley would be wise to use for promotion.
There is a lot to unpack in this code—but there is also a lot to learn. It builds on ...