Handling punctuation and numbers

You may have noticed several instances of newlines ( \n) in the text. In most cases, punctuation, numbers, and extra white space are unnecessary for NLP analysis. In fact, these elements inflate the word count but don’t add meaning. In this lesson, we’ll talk about removing them as well.

Overview of transformations in the tm package

In NLP, stopwords are removed to provide better visibility to significant words. However, stopwords aren’t the only problem when cleaning text data. Text often includes numbers, punctuation, white space, and capitalized versions of words. Therefore, it’s crucial to remove these elements to ensure accurate and effective text processing.

In tm vocabulary, unnecessary terms can be removed with transformations. Transformations are performed across all documents in a corpus and include operations such as removing nontext characters, citations, numbers, and punctuation. This can include converting all documents to plaintext or converting all text to lowercase.

Transformations included with tm can be listed with getTransformations:

Get hands-on with 1200+ tech skills courses.