How to remove stop words with NLTK library in Python

Introduction

When working with text data in NLP, we usually have to preprocess our data before carrying out the main task.

One common preprocessing step we take is removing stop words.

Let’s get to it.

What exactly are stop words?

Stop words are words in any language or corpus that occur frequently. For some NLP tasks, they do not provide any additional or valuable information to the text containing them. Words like a, they, the, is, an, etc. are usually considered stop words.

Let’s take the title of this article as an example:

How to remove stop words with NLTK library in Python

Words like how, to, with, and in, do not clearly state the topic of the article. However, keywords like remove, stop words, NLTK, library, and Python, give a much clearer idea of what to expect from this article.

Interestingly, some of these keywords are part of the tags for this article :)

Text: "How to remove stop words with NLTK library in Python"
Tokens: ['how', 'to', 'remove', 'stop', 'words', 'with', 'nltk', 'library', 'in', 'python']
Text without stop words: "remove stop words nltk library python"

Specializing

Sometimes you may need to add or remove words from your list of stop words.

For example, imagine you’re trying to classify food magazines based on what kinds of foods are the focus. Now, you would expect that the word food (or similar words) would be mentioned a lot. These would not provide valuable information.

Hence, food is a stop word and you may consider adding it to your list of stop words.

Luckily, stopwords.words('english') returns a regular Python list which we can easily modify. Keep in mind that this does not change the stop words you downloaded to your disk.

How to remove stop words with NLTK library in Python

Introduction

What exactly are stop words?

Removing stop words

Specializing

Not just in English