In this shot, we will learn what tokenization is and how to use wordpunct_tokenize()
in a Python program.
Before moving on, you have to understand the role of nltk in tokenization.
NLTK, or Natural Language Toolkit, is an open-source library or platform used to build Python-based programs. It is a suite that contains libraries and programs for statistical language processing.
NLTK supports dozens of features and tasks. It can be used to do anything from text-processing techniques, stop word removal, tokenization, stemming, lemmatization, and much more.
By tokenizing a text, you can conveniently work with smaller
Tokenizing can be done in either of the following two ways:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
wordpunct_tokenize()
At this point, it should be quite clear to you why we use NLTK and what libraries are related to tokenizing in any project.
wordpunct_tokenize()
is just another package that can be imported using nltk.
This particular tokenizer requires the Punkt sentence tokenization models to be installed.
So, let’s install the packages we require as shown below:
pip install nltk
You can run this command on your machine as well as on Google Colab.
Moving on towards the coding part of wordpunct_tokenize()
…
NLTK also provides a simpler, regular-expression based tokenizer that splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', ', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
As we can see, Wordpunct_tokenize
will split almost all special symbols and treat them as separate units.
Wordpunct_tokenize()
recognizes the characters and splits them separately from one another.
RELATED TAGS
CONTRIBUTOR
View all Courses