What is the clean() method of the clean-text package in Python?
What is the clean-text package?
clean-text is a third-party package that preprocesses text data to obtain a normalized text representation.
The package can be installed via pip. Check the following command to install the clean-text package:
pip install clean-text
clean() method
The clean() method replaces all the URLs in the given text with the replacement string.
Method signature
clean(
text,
fix_unicode=True,
to_ascii=True,
lower=True,
normalize_whitespace=True,
no_line_breaks=False,
strip_lines=True,
keep_two_line_breaks=False,
no_urls=False,
no_emails=False,
no_phone_numbers=False,
no_numbers=False,
no_digits=False,
no_currency_symbols=False,
no_punct=False,
no_emoji=False,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
replace_with_punct="",
lang="en",
)
Parameters
text: This is the text to preprocess.fix_unicode=True: A boolean value indicating whether or not to fix broken unicodes.to_ascii=True: If this isTruethen it converts non-to_ascii characters into their closest to_ascii equivalents.lower=True: If this isTrue, it converts the text to lowercase.no_line_breaks=False: If this isTrue, it strips the line breaks from the text.no_urls=False: This is a boolean value that indicates replacing all the URL strings in the text with a special URL token.no_emails=False: This is a boolean value that indicates whether to replace all emails in the text with a special EMAIL token.no_phone_numbers=False: This is a boolean value indicating whether to replace all the phone numbers in the text with a special PHONE token.no_numbers=False: This is a boolean value indicating whether to replace all the numbers in the text with a special NUMBER token.no_digits=False: This is a boolean value indicating whether to replace all the digits in the text with a special DIGIT token.no_currency_symbols=False: This is a boolean value indicating whether to replace all the currency symbols in the text with a special CURRENCY token.no_punct=False: This is a boolean value indicating whether to remove all the punctuations in the text.replace_with_url="<URL>": This is the special URL token. The default value is<URL>.replace_with_email="<EMAIL>": This is the special EMAIL token. The default value is<EMAIL>.replace_with_phone_number="<PHONE>": This is the special PHONE token. The default value is<PHONE>.replace_with_number="<NUMBER>": This is the special NUMBER token. The default value is<NUMBER>.replace_with_digit="0": This is the special DIGIT token. The default value is0.replace_with_currency_symbol="<CUR>": This is the special CURRENCY token. The default value is<CUR>.replace_with_punct="": We replace the punctuations with this string. The default value is an empty string.lang="en": This is a parameter to mention the language that indicates the type of text preprocessing. The default value is English (‘en’). Other than English, only German (‘de’) is supported.
Return value
The method returns the cleaned text depending on the different parameters passed.
Code example 1
import cleantexttxt = "Hello Educative!!! How are you?"new_txt = cleantext.clean(txt, no_punct=True)print("Original String - '" + txt + "'")print("Modified String after removing punctuations - '" + new_txt + "'")
Code explanation
- Line 1: We import the
cleantextpackage. - Line 3: We define a string called
txtwith punctuations. - Line 5: We remove all the punctuations in
txtusing thecleanmethod and passingno_punctasTrue. The result is stored innew_txt. - Lines 7-9: We print the original and the modified strings.
Code example 2
import cleantexttxt = "Hello Educative!!! 123 How are you? 456"new_txt = cleantext.clean(txt, no_numbers=True)print("Original String - '" + txt + "'")print("Modified String after replacing numbers - '" + new_txt + "'")
Code explanation
- Line 1: We import the
cleantextpackage. - Line 3: We define a string called
txtwith numbers in it. - Line 5: We replace all the numbers in
txtwith the special NUMBER token using thecleanmethod and passingno_numbersasTrue. The result is stored innew_txt. - Lines 7-9: We print the original and the modified strings.