How to use wordpunct_tokenize() in NLTK

In this shot, we will learn what tokenization is and how to use wordpunct_tokenize() in a Python program.

Before moving on, you have to understand the role of nltk in tokenization.

This particular tokenizer requires the Punkt sentence tokenization models to be installed.

So, let’s install the packages we require as shown below:

pip install nltk

You can run this command on your machine as well as on Google Colab.

Moving on towards the coding part of wordpunct_tokenize()…

NLTK also provides a simpler, regular-expression based tokenizer that splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me 
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
    
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

How to use wordpunct_tokenize() in NLTK

What is NLTK?

Tokenizing

Importing nltk libraries in your project

Tokenizing by word

Tokenizing by sentences

`wordpunct_tokenize()`

Explanation