What is WhitespaceTokenizer in Python ?

In Natural Language Processing, tokenization is dividing a string into a list of tokens. Tokens come in handy when finding valuable patterns and help to replace sensitive data components with non-sensitive ones.

Tokens can be thought of as a word in a sentence or a sentence in a paragraph.

WhitespaceTokenizer in Python splits a string on whitespace, i.e., space, tab, and newline.

The split function in Python works similarly.

In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK) library. The library needs to be imported in the code.

Installation of NLTK

With Python 2.x, NLTK can be installed in the device by:

pip install nltk

With Python 3.x, NLTK can be installed in the device by:

pip3 install nltk

However, installation is not yet complete. In the Python file, the code below needs to be run:

import nltk
nltk.download()

Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.

Example

The code below explains how the WhitespaceTokenizer functions.

from nltk.tokenize import WhitespaceTokenizer
data = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
print(WhitespaceTokenizer().tokenize(data))

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved