Preprocessing steps in Natural Language Processing (NLP)

Overview

Natural Language Processing (NLP) is the ability of a machine to read, write, understand and derive meaning from a human language.

Steps in NLP

Tokenization
Stemming
Lemmatization
Part-of-speech (POS) tagging
Named entity recognition
Chunking

Let’s try to understand them in more detail.

Tokenization: We break down the text into tokens. Check the example below to see how this is done.

Text: The cat sat on the bed. Tokens: The, cat, sat, on, the, bed

Stemming: We remove the prefixes and suffixes to obtain the root word. Check the example below to see how it’s done.

List of words: Affection, Affects, Affecting, Affected, Affecting
Root word: Affect

Lemmatization: We group together different inflected forms of a word into a base word called lemma. Check the example below how it’s done.

List of words: going, gone, went
Lemma: go

4.POS tagging: We identify the parts of speech for different tokens. Check the example below to see how it’s done.

Sentence: The dog killed the bat.
Parts of speech: Definite article, noun, verb, definite article, noun.

5.Named entity recognition: We classify named entities mentioned in the text into categories such as “People,” “Locations,” “Organizations,” and so on. Check the example below to see how it’s done.

Text: Google CEO Sundar Pichai resides in New York.
Named entity recognition:
Google — Organization
Sundar Pichai — Person
New York — Location

6.Chunking: We pick up individual pieces of information and group them into bigger pieces.

Example

import nltk
nltk.download('all-nltk')
print("\n")

# Creating token of words
print("Creating token of words:")
from nltk.tokenize import word_tokenize
text="My name is Adithya Challa I wrote this shot!"
tokenize_word=word_tokenize(text)
print(tokenize_word)
print("\n")

# Stemming
print("Stemming:")
from nltk.stem import PorterStemmer
words=["light","lighting","lights"]
ps=PorterStemmer()
for w in words:
    rootword=ps.stem(w)
    print(rootword)
print("\n")

#Lemmatiztion:Converts allverb forms into root word
print("Lemmatiztion:Converts allverb forms into root word:")
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("playing"))
print("\n")

#POS Tag
print("POS Tag:")
from nltk import word_tokenize,pos_tag
text="My name is Adithya Challa I wrote this shot!"
print(pos_tag(word_tokenize(text)))

Implementation of NLP preprocessing steps

Preprocessing steps in Natural Language Processing (NLP)

Overview

Steps in NLP

Example

Explanation