In Natural Language Processing, tokenization is dividing a string into a list of tokens. Tokens come in handy when finding valuable patterns and help to replace sensitive data components with non-sensitive ones.
Tokens can be thought of as a word in a sentence or a sentence in a paragraph.
WhitespaceTokenizer
in Python splits a string on whitespace, i.e., space, tab, and newline.
The
split
function in Python works similarly.
In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK
) library. The library needs to be imported in the code.
NLTK
With Python 2.x, NLTK
can be installed in the device by:
pip install nltk
With Python 3.x, NLTK
can be installed in the device by:
pip3 install nltk
However, installation is not yet complete. In the Python file, the code below needs to be run:
import nltk
nltk.download()
Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.
The code below explains how the WhitespaceTokenizer
functions.
from nltk.tokenize import WhitespaceTokenizerdata = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."print(WhitespaceTokenizer().tokenize(data))
Free Resources