Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

communitycreator
nlp
machine learning
python

Preprocessing steps in Natural Language Processing (NLP)

Adithya Challa

Overview

Natural Language Processing (NLP) is the ability of a machine to read, write, understand and derive meaning from a human language.

Steps in NLP

  • Tokenization
  • Stemming
  • Lemmatization
  • Part-of-speech (POS) tagging
  • Named entity recognition
  • Chunking

Let’s try to understand them in more detail.

  1. Tokenization: We break down the text into tokens. Check the example below to see how this is done.

Text: The cat sat on the bed. Tokens: The, cat, sat, on, the, bed

  1. Stemming: We remove the prefixes and suffixes to obtain the root word. Check the example below to see how it’s done.

List of words: Affection, Affects, Affecting, Affected, Affecting
Root word: Affect

  1. Lemmatization: We group together different inflected forms of a word into a base word called lemma. Check the example below how it’s done.

List of words: going, gone, went
Lemma: go

4.POS tagging: We identify the parts of speech for different tokens. Check the example below to see how it’s done.

Sentence: The dog killed the bat.
Parts of speech: Definite article, noun, verb, definite article, noun.

5.Named entity recognition: We classify named entities mentioned in the text into categories such as “People,” “Locations,” “Organizations,” and so on. Check the example below to see how it’s done.

Text: Google CEO Sundar Pichai resides in New York.
Named entity recognition:
Google — Organization
Sundar Pichai — Person
New York — Location

6.Chunking: We pick up individual pieces of information and group them into bigger pieces.

Example

import nltk
nltk.download('all-nltk')
print("\n")

# Creating token of words
print("Creating token of words:")
from nltk.tokenize import word_tokenize
text="My name is Adithya Challa I wrote this shot!"
tokenize_word=word_tokenize(text)
print(tokenize_word)
print("\n")

# Stemming
print("Stemming:")
from nltk.stem import PorterStemmer
words=["light","lighting","lights"]
ps=PorterStemmer()
for w in words:
    rootword=ps.stem(w)
    print(rootword)
print("\n")

#Lemmatiztion:Converts allverb forms into root word
print("Lemmatiztion:Converts allverb forms into root word:")
from nltk.stem import WordNetLemmatizer
lem=WordNetLemmatizer()
print(lem.lemmatize("playing"))
print("\n")

#POS Tag
print("POS Tag:")
from nltk import word_tokenize,pos_tag
text="My name is Adithya Challa I wrote this shot!"
print(pos_tag(word_tokenize(text)))    

Explanation

  • Lines 1 and 2: We download the nltk package and import the module.

  • Lines 7–9: We use nltk.tokenize by importing word_tokenize and divide the string of words into tokens.

  • Lines 15–20: We use nltk.stem by importing PorterStemmerand remove the prefixes and suffixes to obtain a root word.

  • Lines 25 and 26: We convert all the verb forms into root words by importing WordNetLemmatizer.

  • Lines 32 and 33: We find the parts of speech by importing word_tokenize,pos_tag.

RELATED TAGS

communitycreator
nlp
machine learning
python
RELATED COURSES

View all Courses

Keep Exploring