What is WhitespaceTokenizer in Python ?

In Natural Language Processing, tokenization is dividing a string into a list of tokens. Tokens come in handy when finding valuable patterns and help to replace sensitive data components with non-sensitive ones.

Tokens can be thought of as a word in a sentence or a sentence in a paragraph.

WhitespaceTokenizer in Python splits a string on whitespace, i.e., space, tab, and newline.

The split function in Python works similarly.

In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK) library. The library needs to be imported in the code.

Installation of `NLTK`

With Python 2.x, NLTK can be installed in the device by:

pip install nltk

With Python 3.x, NLTK can be installed in the device by:

pip3 install nltk

However, installation is not yet complete. In the Python file, the code below needs to be run:

import nltk
nltk.download()

Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.

What is WhitespaceTokenizer in Python ?

Installation of NLTK

Example

Installation of `NLTK`