Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

nlp

What is stemming in NLP?

abhilash

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Overview

We reduce word inflection to its root forms with an approach known as stemming. It’s a natural language processing method that helps prepare text, words, and documents for text normalization.

Word inflection is a process through which we alter words to convey a variety of grammatical categories, including tense, case, voice, aspect, person, number, gender, and mood. Therefore, even if a word may have various inflected forms, the NLP process becomes more complicated when multiple inflected forms appear in the exact text.

Consider the words playing, played, and plays. The root word for all these words is play.

Hence, we’ll use stemming for information retrieval, text mining SEOs, Web search results, indexing, tagging systems, word analysis, and more.

Below are some of the different stemming algorithms in Python NLTK:

  • Porter stemmer
  • Snowball stemmer
  • Lancaster stemmer

Porter stemmer

The Porter stemmer is well known for its simplicity and speed. Often, the resulting stem is the shorter term with the same root meaning. It’s designed to remove and replace well-known suffixes of English words.

In NLTK, we use the PorterStemmer() class to implement the Porter stemmer algorithm.

We use the stem() method of the PorterStemmer() class to stem a given word.

Code

from nltk.stem import PorterStemmer

porter = PorterStemmer()

words = ['plays','playing','played','play']

for word in words:
    print(word, "->", porter.stem(word))

Snowball stemmer

The Snowball stemmer is precise and faster. It’s also designed to remove and replace well-known suffixes of English words.

In NLTK, we use the SnowballStemmer() class to implement the Snowball stemmer algorithm. It supports 15 non-English languages.

We use the stem() method of the SnowballStemmer() class to stem a given word.

Code

from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')

words = ['plays','playing','played','play']

for word in words:
    print(word, "->", snowball.stem(word))

Lancaster stemmer

Although the Lancaster stemmer is simple, it frequently yields results with excessive stemming. Over-stemming makes stems unintelligible or non-linguistic.

In NLTK, we use the LancasterStemmer() class to implement the Lancaster stemmer algorithm.

We use the stem() method of the LancasterStemmer() class to stem a given word.

Code

from nltk.stem import LancasterStemmer

lancast = LancasterStemmer()

words = ['plays','playing','played','play']

for word in words:
    print(word, "->", lancast.stem(word))

RELATED TAGS

nlp

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring