Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

python

How to use wordpunct_tokenize() in NLTK

Dian Us Suqlain

In this shot, we will learn what tokenization is and how to use wordpunct_tokenize() in a Python program.

Before moving on, you have to understand the role of nltk in tokenization.

What is NLTK?

NLTK, or Natural Language Toolkit, is an open-source library or platform used to build Python-based programs. It is a suite that contains libraries and programs for statistical language processing.

NLTK supports dozens of features and tasks. It can be used to do anything from text-processing techniques, stop word removal, tokenization, stemming, lemmatization, and much more.

Tokenizing

By tokenizing a text, you can conveniently work with smaller piecesalso called tokens of text that are still relatively logical and meaningful.

Tokenizing can be done in either of the following two ways:

  1. tokenize by words
  2. tokenize by sentences

Importing nltk libraries in your project

Tokenizing by word

from nltk.tokenize import word_tokenize

Tokenizing by sentences

from nltk.tokenize import sent_tokenize

wordpunct_tokenize()

At this point, it should be quite clear to you why we use NLTK and what libraries are related to tokenizing in any project.

wordpunct_tokenize() is just another package that can be imported using nltk.

This particular tokenizer requires the Punkt sentence tokenization models to be installed.

So, let’s install the packages we require as shown below:

pip install nltk

You can run this command on your machine as well as on Google Colab.

Moving on towards the coding part of wordpunct_tokenize()

NLTK also provides a simpler, regular-expression based tokenizer that splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me 
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
    
['Good', 'muffins', 'cost', ', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Explanation

As we can see, Wordpunct_tokenize will split almost all special symbols and treat them as separate units.

Wordpunct_tokenize() recognizes the characters and splits them separately from one another.

RELATED TAGS

python

CONTRIBUTOR

Dian Us Suqlain
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring