What is word_tokenize in Python?

In Natural Language Processing, tokenization divides a string into a list of tokens. Tokens come in handy when finding valuable patterns and helping to replace sensitive data components with non-sensitive ones.

Tokens can be though of as a word in a sentence or a sentence in a paragraph.

word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

Figure 1 below shows the tokenization of sentence into words.

In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK) library.

Installation of `NLTK`

With Python 2.x, NLTK can be installed in the device with the command shown below:

pip install nltk

With Python 3.x, NLTK can be installed in the device with the command shown below:

pip3 install nltk

However, installation is not yet complete. In the Python file, the code shown below needs to be run:

import nltk
nltk.download()

Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.

What is word_tokenize in Python?

Installation of NLTK

Example

Installation of `NLTK`