How to do tokenization in NLP

Natural language processing (NLP) focuses on how computers and human language interact. It involves developing algorithms that enable computers to understand and comprehend human language and extract useful information.

What is tokenization?

Tokenization is breaking a sentence or paragraph into chunks called tokens. These tokens may be words, characters, or parts of words. By tokenizing text, NLP algorithms can operate on smaller and more meaningful units. This enables more accurate analysis, modeling, and understanding of textual data.

Now that we know what tokenization is, let's look at some tokenization techniques.

Tokenization techniques

We can tokenize the given text input in various ways. We can select any method depending on the language, library, and modeling goal.

Tokenization implementation

Tokenization is an essential process in natural language processing (NLP). Let's take a look at the necessary steps to implement tokenization.

Choose a programming language or library

Some popular NLP tokenization choices include Python with libraries like NLTK, spaCy, scikit-learn, and Apache OpenNLP (Java).

Apply tokenization techniques

Once the text data is loaded and prepared, it's time to apply the chosen tokenization technique. The specific steps may vary depending on the library or tool we're using. Still, the general process involves calling the tokenization function or method provided by the library and passing the text data as input.

Tokenization: Using Python's inbuilt method

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
# Sample text
text = "Tokenization is important for NLP. It helps in breaking down text into individual units."
# Word tokenization using NLTK
# Tokenize the text into individual words
word_tokens = word_tokenize(text)
print("Word tokens:")
print(word_tokens)
# Sentence tokenization using NLTK
# Tokenize the text into individual sentences
sentence_tokens = sent_tokenize(text)
print("\nSentence tokens:")
print(sentence_tokens)

import re
# Sample text
text = "Tokenization is important for NLP. It helps in breaking down text into individual units."
# Word tokenization using regular expressions
# Pattern explanation: \b\w+\b matches one or more word characters surrounded by word boundaries
word_tokens = re.findall(r'\b\w+\b', text)
print("Word tokens:")
print(word_tokens)
# Sentence tokenization using regular expressions
# Pattern explanation: (?<=\w\.)\s matches a whitespace character preceded by a word character followed by a period
sentence_tokens = re.split(r'(?<=\w\.)\s', text)
print("\nSentence tokens:")
print(sentence_tokens)

Code explanation

Line 1–3: Import re, which stands for regular expression.
Line 4–8: The re.findall() function searches for patterns in the given text using the specified regular expression pattern \b\w+\b. This pattern matches one or more word characters surrounded by word boundaries.
Line 9–14: sentence_tokens = re.split(r'(?<=\w\.)\s', text) - This line performs sentence tokenization using regular expressions. The re.split() function splits the given text using the specified regular expression pattern (?<=\w\.)\s. This pattern matches a whitespace character preceded by a word character followed by a period. The resulting sentence tokens are stored in the sentence_tokens variable.

As explained in this Answer, there are various ways to tokenize in NLP. Here's a summary of all the functions used in popular Python libraries:

Library	Word-tokenisation Methods
NLKT	nltk.word_tokenize
spaCy	nlp.tokenizer
Gensim	gensim.utils.tokenize
Keras	keras.preprocessing.text.Tokenizer
Sci-kit Learn	TextBlob.word_tokenize

How to do tokenization in NLP

What is tokenization?

Tokenization techniques

Tokenization implementation

Choose a programming language or library

Apply tokenization techniques

Tokenization: Using Python's inbuilt method

Tokenization by using NLTK

Code explanation:

Tokenization by using regular expressions(RegEx)

Code explanation

Tokenization methods in Python libraries