What is the clean() method of the clean-text package in Python?

What is the clean-text package?

clean-text is a third-party package that preprocesses text data to obtain a normalized text representation.

The package can be installed via pip. Check the following command to install the clean-text package:

pip install clean-text

clean() method

The clean() method replaces all the URLs in the given text with the replacement string.

Method signature

clean(
    text,
    fix_unicode=True,
    to_ascii=True,
    lower=True,
    normalize_whitespace=True,
    no_line_breaks=False,
    strip_lines=True,
    keep_two_line_breaks=False,
    no_urls=False,
    no_emails=False,
    no_phone_numbers=False,
    no_numbers=False,
    no_digits=False,
    no_currency_symbols=False,
    no_punct=False,
    no_emoji=False,
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    replace_with_punct="",
    lang="en",
)

Parameters

  • text: This is the text to preprocess.
  • fix_unicode=True: A boolean value indicating whether or not to fix broken unicodes.
  • to_ascii=True: If this is True then it converts non-to_ascii characters into their closest to_ascii equivalents.
  • lower=True: If this is True, it converts the text to lowercase.
  • no_line_breaks=False: If this is True, it strips the line breaks from the text.
  • no_urls=False: This is a boolean value that indicates replacing all the URL strings in the text with a special URL token.
  • no_emails=False: This is a boolean value that indicates whether to replace all emails in the text with a special EMAIL token.
  • no_phone_numbers=False: This is a boolean value indicating whether to replace all the phone numbers in the text with a special PHONE token.
  • no_numbers=False: This is a boolean value indicating whether to replace all the numbers in the text with a special NUMBER token.
  • no_digits=False: This is a boolean value indicating whether to replace all the digits in the text with a special DIGIT token.
  • no_currency_symbols=False: This is a boolean value indicating whether to replace all the currency symbols in the text with a special CURRENCY token.
  • no_punct=False: This is a boolean value indicating whether to remove all the punctuations in the text.
  • replace_with_url="<URL>": This is the special URL token. The default value is <URL>.
  • replace_with_email="<EMAIL>": This is the special EMAIL token. The default value is <EMAIL>.
  • replace_with_phone_number="<PHONE>": This is the special PHONE token. The default value is <PHONE>.
  • replace_with_number="<NUMBER>": This is the special NUMBER token. The default value is <NUMBER>.
  • replace_with_digit="0": This is the special DIGIT token. The default value is 0.
  • replace_with_currency_symbol="<CUR>": This is the special CURRENCY token. The default value is <CUR>.
  • replace_with_punct="": We replace the punctuations with this string. The default value is an empty string.
  • lang="en": This is a parameter to mention the language that indicates the type of text preprocessing. The default value is English (‘en’). Other than English, only German (‘de’) is supported.

Return value

The method returns the cleaned text depending on the different parameters passed.

Code example 1

import cleantext
txt = "Hello Educative!!! How are you?"
new_txt = cleantext.clean(txt, no_punct=True)
print("Original String - '" + txt + "'")
print("Modified String after removing punctuations - '" + new_txt + "'")

Code explanation

  • Line 1: We import the cleantext package.
  • Line 3: We define a string called txt with punctuations.
  • Line 5: We remove all the punctuations in txt using the clean method and passing no_punct as True. The result is stored in new_txt.
  • Lines 7-9: We print the original and the modified strings.

Code example 2

import cleantext
txt = "Hello Educative!!! 123 How are you? 456"
new_txt = cleantext.clean(txt, no_numbers=True)
print("Original String - '" + txt + "'")
print("Modified String after replacing numbers - '" + new_txt + "'")

Code explanation

  • Line 1: We import the cleantext package.
  • Line 3: We define a string called txt with numbers in it.
  • Line 5: We replace all the numbers in txt with the special NUMBER token using the clean method and passing no_numbers as True. The result is stored in new_txt.
  • Lines 7-9: We print the original and the modified strings.