How to perform tokenization of text using TextBlob in Python

Natural Language ProcessingNLP is a fast-growing technology that deals with text data to perform several applications, including chat bots, sentiment analysis, semantic analysis, and more.

TextBlob is one of the most important and basic libraries that deals with finding sentiment scores, filtering, and tokenization.

Before we move on, we need to install TextBlob. To do so, we run the commands mentioned below in the command line tool.

Use the following code on the command line:

pip install -U textblob 
python -m textblob.download_corpora

Tokenization

Before we proceed, it is important to understand the following terms:

Corpus: This is the collection of the text data (in any language) which can be further used in semantic analysis, classification, etc.
Token: This refers to the strings divided from the input text.

Tokenization is the process of dividing or separating sentences or words from the text (corpus) into smaller units.

Example

Suppose that the input text is “I love to eat fast food.”

After applying the tokenization to this input text, the output contains all the words separated from the sentence as follows: [“I”, “love”, “to”, “eat”, “fast”, “food”].

We can also divide a single word into tokens. For instance: banana can be tokenized to b-a-n-a-n-a.

Code

Let’s look at a code for tokenizing text using TextBlob.

Explanation

In line 1, we import the required package.
From lines 3 to 5, we create a sample corpus of text.
In line 7, we create a TextBlob object and pass the corpus we want to tokenize.
In line 9, we print the tokenization of corpus based on the words.
In line 11, we print the tokenization of corpus based on the sentence.

When we pre-process the text data, tokenization plays an important role. It divides the corpus into sentences, words, or even characters.

TextBlob is one of the most important libraries in NLP. It offers a simple API that helps us perform NLP tasks faster.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)