Preprocessing steps in Natural Language Processing (NLP)
Overview
Natural Language Processing (NLP) is the ability of a machine to read, write, understand and derive meaning from a human language.
Steps in NLP
- Tokenization
- Stemming
- Lemmatization
- Part-of-speech (POS) tagging
- Named entity recognition
- Chunking
Let’s try to understand them in more detail.
- Tokenization: We break down the text into tokens. Check the example below to see how this is done.
Text: The cat sat on the bed. Tokens:
The,cat,sat,on,the,bed
- Stemming: We remove the prefixes and suffixes to obtain the root word. Check the example below to see how it’s done.
List of words: Affection, Affects, Affecting, Affected, Affecting
Root word: Affect
- Lemmatization: We group together different inflected forms of a word into a base word called lemma. Check the example below how it’s done.
List of words: going, gone, went
Lemma: go
4.POS tagging: We identify the parts of speech for different tokens. Check the example below to see how it’s done.
Sentence: The dog killed the bat.
Parts of speech: Definite article, noun, verb, definite article, noun.
5.Named entity recognition: We classify named entities mentioned in the text into categories such as “People,” “Locations,” “Organizations,” and so on. Check the example below to see how it’s done.
Text: Google CEO Sundar Pichai resides in New York.
Named entity recognition:
Google — Organization
Sundar Pichai — Person
New York — Location
6.Chunking: We pick up individual pieces of information and group them into bigger pieces.
Example
import nltk
nltk.download('all-nltk')
print("\n")
# Creating token of words
print("Creating token of words:")
from nltk.tokenize import word_tokenize
text="My name is Adithya Challa I wrote this shot!"
tokenize_word=word_tokenize(text)
print(tokenize_word)
print("\n")
# Stemming
print("Stemming:")
from nltk.stem import PorterStemmer
words=["light","lighting","lights"]
ps=PorterStemmer()
for w in words:
rootword=ps.stem(w)
print(rootword)
print("\n")
#Lemmatiztion:Converts allverb forms into root word
print("Lemmatiztion:Converts allverb forms into root word:")
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("playing"))
print("\n")
#POS Tag
print("POS Tag:")
from nltk import word_tokenize,pos_tag
text="My name is Adithya Challa I wrote this shot!"
print(pos_tag(word_tokenize(text)))
Explanation
-
Lines 1 and 2: We download the
nltk packageand import the module. -
Lines 7–9: We use
nltk.tokenizeby importingword_tokenizeand divide the string of words into tokens. -
Lines 15–20: We use
nltk.stemby importingPorterStemmerand remove the prefixes and suffixes to obtain a root word. -
Lines 25 and 26: We convert all the verb forms into root words by importing
WordNetLemmatizer. -
Lines 32 and 33: We find the parts of speech by importing
word_tokenize,pos_tag.