How to remove duplicate words from text using Regex in Python

In this shot, we will use Regular Expressions, or Regex, to remove duplicate words from the text.

What is regex?

A regular expression, or regex, is basically a pattern used to search for something in textual data. Using regex can help you eliminate a dozen lines of code. Although understanding regex is a bit difficult due to its complex structure, these expressions can be accommodating if you practice them. These expressions are mainly used in text processing or when you are dealing with text data.

Implementing regex

We are going to use the below regex:

regex = "\\b(\\w+)(?:\\W+\\1\\b)+";

Let’s break down the sections:

\\b: This means “boundary,” which is needed because if you have a text like, “My thesis is great” and you want to find the occurrence of “is”, then it should not match with “thesis” as this word also has the occurrence of the “is” pattern. Here, word boundaries are helpful.
\\w: This denotes a word character, i.e., [a-zA-Z_0–9].
\\W+: This means a non-word character.
\\1: This matches whatever was matched in the previous group of parentheses, which in our case is the (\w+).
+: This is used to match whatever is placed before this 1 or more times.

Now, let’s take a look at the code:

Explanation:

In line 1, we import the re package, which will allow us to use regex.
In line 3, we define a function that will return text after removing the duplicate words.
In line 4, we define our regex pattern.
In line 5, we use the sub() function of the re module that returns a substring. Here, we pass the regex pattern: the \1 specifies what needs to be replaced in the input text when the regex pattern matches the text, and the flag ignores the case letters.
From lines 7 to 14, we pass some text data containing duplicate words (we can see in the output that it can remove duplicate words from the text).

In this way, it is somewhat effortless to perform text preprocessing.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)