What is to_ascii_unicode() in a clean-text package in Python?
What is clean-text package?
clean-text is a third-party package used to pre-process text data to obtain a normalized text representation.
The package can be installed via pip. Check the following command to install the clean-text package.
pip install clean-text
to_ascii_unicode() method
The to_ascii_unicode() method is used to convert Unicode text data to ASCII text data. The text data can also contain emojis.
Syntax
to_ascii_unicode(text, lang="en", no_emoji=False)
Parameters
text: The text data to convert from Unicode to ASCII.lang: The language the text data is in. The default value is English.no_emoji: A boolean value indicating whether to remove emojis in the text or not. By default, the value isFalse.
Return value
The method returns ASCII data.
Code
Let’s look at the code below:
import cleantextunicode_str_1 = "ḧ̤ë̤l̤̈l̤̈ö̤ ë̤d̤̈p̤̈r̤̈ë̤s̤̈s̤̈ö̤"ascii_converted_str_1 = cleantext.to_ascii_unicode(unicode_str_1)print("Unicode String - " + unicode_str_1 + " ; ASCII String - " + ascii_converted_str_1)unicode_str_2 = "🤔 h́éĺĺό éd́úćát́ív́é🙈"ascii_converted_str_2 = cleantext.to_ascii_unicode(unicode_str_2, lang="en", no_emoji=False)print("Unicode String - " + unicode_str_2 + " ; ASCII String (retain emojis) - " + ascii_converted_str_2)ascii_converted_str_3 = cleantext.to_ascii_unicode(unicode_str_2, lang="en", no_emoji=True)print("Unicode String - " + unicode_str_2 + " ; ASCII String (discard emojis) - " + ascii_converted_str_3)
Explanation
-
Line 1 : We import the
cleantextpackage. -
Line 3 : We define a Unicode string called
unicode_str_1. -
Line 4: We convert
unicode_str_1to ASCII usingto_ascii_unicode()method. The ASCII string is calledascii_converted_str_1. -
Line 5: We print
ascii_converted_str_1andunicode_str_1. -
Line 7: We define a Unicode string with emojis called
unicode_str_2. -
Line 8: We convert
unicode_str_2to ASCII using theto_ascii_unicode()method. The ASCII string is calledascii_converted_str_2. -
Line 9: We print
ascii_converted_str_2andunicode_str_2. -
Line 11 : We convert
unicode_str_2to ASCII using theto_ascii_unicode()method with English as the language. We passTrueas a value to theno_emojiparameter to remove emojis from the text. The ASCII string is calledascii_converted_str_3. -
Line 12: We print
ascii_converted_str_3andunicode_str_2.