How to do tokenization in NLP
Natural language processing (NLP) focuses on how computers and human language interact. It involves developing algorithms that enable computers to understand and comprehend human language and extract useful information.
What is tokenization?
Tokenization is breaking a sentence or paragraph into chunks called tokens. These tokens may be words, characters, or parts of words. By tokenizing text, NLP algorithms can operate on smaller and more meaningful units. This enables more accurate analysis, modeling, and understanding of textual data.
Now that we know what tokenization is, let's look at some tokenization techniques.
Tokenization techniques
We can tokenize the given text input in various ways. We can select any method depending on the language, library, and modeling goal.
Tokenization implementation
Tokenization is an essential process in natural language processing (NLP). Let's take a look at the necessary steps to implement tokenization.
Choose a programming language or library
Some popular NLP tokenization choices include Python with libraries like NLTK, spaCy, scikit-learn, and Apache OpenNLP (Java).
Apply tokenization techniques
Once the text data is loaded and prepared, it's time to apply the chosen tokenization technique. The specific steps may vary depending on the library or tool we're using. Still, the general process involves calling the tokenization function or method provided by the library and passing the text data as input.
Tokenization: Using Python's inbuilt method
# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenizationword_tokens = text.split()print("Word tokens:")print(word_tokens)# Sentence tokenizationsentence_tokens = text.split(". ")print("\nSentence tokens:")print(sentence_tokens)
Tokenization by using NLTK
import nltknltk.download('punkt')from nltk.tokenize import word_tokenize, sent_tokenize# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenization using NLTK# Tokenize the text into individual wordsword_tokens = word_tokenize(text)print("Word tokens:")print(word_tokens)# Sentence tokenization using NLTK# Tokenize the text into individual sentencessentence_tokens = sent_tokenize(text)print("\nSentence tokens:")print(sentence_tokens)
Code explanation:
Line 1–3: We import the NLTK library.
Line 4–10: The
word_tokenize()function from NLTK is used to tokenize the text into individual words.Line 10–18: The
sent_tokenize()function from NLTK is used to tokenize the text into individual sentences.
Tokenization by using regular expressions(RegEx)
import re# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenization using regular expressions# Pattern explanation: \b\w+\b matches one or more word characters surrounded by word boundariesword_tokens = re.findall(r'\b\w+\b', text)print("Word tokens:")print(word_tokens)# Sentence tokenization using regular expressions# Pattern explanation: (?<=\w\.)\s matches a whitespace character preceded by a word character followed by a periodsentence_tokens = re.split(r'(?<=\w\.)\s', text)print("\nSentence tokens:")print(sentence_tokens)
Code explanation
Line 1–3: Import
re, which stands for regular expression.Line 4–8: The
re.findall()function searches for patterns in the giventextusing the specified regular expression pattern\b\w+\b. This pattern matches one or more word characters surrounded by word boundaries.Line 9–14:
sentence_tokens = re.split(r'(?<=\w\.)\s', text)- This line performs sentence tokenization using regular expressions. There.split()function splits the giventextusing the specified regular expression pattern(?<=\w\.)\s. This pattern matches a whitespace character preceded by a word character followed by a period. The resulting sentence tokens are stored in thesentence_tokensvariable.
As explained in this Answer, there are various ways to tokenize in NLP. Here's a summary of all the functions used in popular Python libraries:
Tokenization methods in Python libraries
Library | Word-tokenisation Methods |
NLKT | nltk.word_tokenize |
spaCy | nlp.tokenizer |
Gensim | gensim.utils.tokenize |
Keras | keras.preprocessing.text.Tokenizer |
Sci-kit Learn | TextBlob.word_tokenize |
Free Resources