Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

python

What is word_tokenize in Python?

Sadia Zubair

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

In Natural Language Processing, tokenization divides a string into a list of tokens. Tokens come in handy when finding valuable patterns and helping to replace sensitive data components with non-sensitive ones.

Tokens can be though of as a word in a sentence or a sentence in a paragraph.

word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

Figure 1 below shows the tokenization of sentence into words.

Figure 1: Splitting of a sentence into words.

In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK) library.

Installation of NLTK

With Python 2.x, NLTK can be installed in the device with the command shown below:

pip install nltk

With Python 3.x, NLTK can be installed in the device with the command shown below:

pip3 install nltk

However, installation is not yet complete. In the Python file, the code shown below needs to be run:

import nltk
nltk.download()

Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.

Example

The code below explains how the word_tokenize function operates.

Some special characters, such as commas, are also treated as tokens.

  • In line 1, the word_tokenize function is imported from the nltk.tokenize library.
  • In line 3, the comma in the sentence will be displayed in the output as a separate token.
from nltk.tokenize import  word_tokenize

data = "Hello, Awesome User"

# tokenization of sentence into words
tokens = word_tokenize(data)

# printing the tokens
print(word_tokenize(data))

RELATED TAGS

python

CONTRIBUTOR

Sadia Zubair
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring