Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

nltk
python
communitycreator

How to remove stop words with NLTK library in Python

Banjoko Judah

Introduction

When working with text data in NLP, we usually have to preprocess our data before carrying out the main task.

One common preprocessing step we take is removing stop words.

Let’s get to it.

What exactly are stop words?

Stop words are words in any language or corpus that occur frequently. For some NLP tasks, they do not provide any additional or valuable information to the text containing them. Words like a, they, the, is, an, etc. are usually considered stop words.

Let’s take the title of this article as an example:

How to remove stop words with NLTK library in Python

Words like how, to, with, and in, do not clearly state the topic of the article. However, keywords like remove, stop words, NLTK, library, and Python, give a much clearer idea of what to expect from this article.

Interestingly, some of these keywords are part of the tags for this article :)

Removing stop words

While there is no universal list of stop words in NLP, many NLP libraries in Python provide their list. We can also decide to create our own list of stop words.

Here we will be using the list of stop words provided by the NLTK library, so we don’t have to write our own.

However, before we can use these stopwords from the NLTK library, we need to download it first.

import nltk

nltk.download('stopwords')

You should have already downloaded the stop words before trying this. Otherwise, you might get a Lookup Error.

Next, we convert our text to lowercase and split it into a list of its words. Afterwards, we create a new list containing words that are not in the list of stop words.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Add text
text = "How to remove stop words with NLTK library in Python"
print("Text:", text)

# Convert text to lowercase and split to a list of words
tokens = word_tokenize(text.lower())
print("Tokens:", tokens)

# Remove stop words
english_stopwords = stopwords.words('english')
tokens_wo_stopwords = [word for word in tokens if word not in english_stopwords]
print("Text without stop words:", " ".join(tokens_wo_stopwords))

The output will look like this:

Text: "How to remove stop words with NLTK library in Python"
Tokens: ['how', 'to', 'remove', 'stop', 'words', 'with', 'nltk', 'library', 'in', 'python']
Text without stop words: "remove stop words nltk library python"

Specializing

Sometimes you may need to add or remove words from your list of stop words.

For example, imagine you’re trying to classify food magazines based on what kinds of foods are the focus. Now, you would expect that the word food (or similar words) would be mentioned a lot. These would not provide valuable information.

Hence, food is a stop word and you may consider adding it to your list of stop words.

Luckily, stopwords.words('english') returns a regular Python list which we can easily modify. Keep in mind that this does not change the stop words you downloaded to your disk.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# It returns a regular Python list
english_stopwords = stopwords.words('english')

# Add a list of words
english_stopwords.extend(['food', 'meal', 'eat'])

# Add a single word
english_stopwords.append('plate')

# Remove a single word
english_stopwords.remove('not')

Not just in English

One exciting thing about NLTK’s stop words corpus is that there are stop words in 16 different languages.

We can get the list of available languages and use them as shown below.

from nltk.corpus import stopwords

# Print the list of available languages
print(stopwords.fileids())

# Use any of the available languages
french_stopwords = stopwords.words('french')
spanish_stopwords = stopwords.words('spanish')
italian_stopwords = stopwords.words('italian')

Thanks for reading!

RELATED TAGS

nltk
python
communitycreator
RELATED COURSES

View all Courses

Keep Exploring