What is to_ascii_unicode() in a clean-text package in Python?

What is clean-text package?

clean-text is a third-party package used to pre-process text data to obtain a normalized text representation.

The package can be installed via pip. Check the following command to install the clean-text package.

pip install clean-text

to_ascii_unicode() method

The to_ascii_unicode() method is used to convert Unicode text data to ASCII text data. The text data can also contain emojis.

Syntax

to_ascii_unicode(text, lang="en", no_emoji=False)

Parameters

  • text: The text data to convert from Unicode to ASCII.
  • lang: The language the text data is in. The default value is English.
  • no_emoji: A boolean value indicating whether to remove emojis in the text or not. By default, the value is False.

Return value

The method returns ASCII data.

Code

Let’s look at the code below:

import cleantext
unicode_str_1 = "ḧ̤ë̤l̤̈l̤̈ö̤ ë̤d̤̈p̤̈r̤̈ë̤s̤̈s̤̈ö̤"
ascii_converted_str_1 = cleantext.to_ascii_unicode(unicode_str_1)
print("Unicode String - " + unicode_str_1 + " ; ASCII String - " + ascii_converted_str_1)
unicode_str_2 = "🤔 h́éĺĺό éd́úćát́ív́é🙈"
ascii_converted_str_2 = cleantext.to_ascii_unicode(unicode_str_2, lang="en", no_emoji=False)
print("Unicode String - " + unicode_str_2 + " ; ASCII String (retain emojis) - " + ascii_converted_str_2)
ascii_converted_str_3 = cleantext.to_ascii_unicode(unicode_str_2, lang="en", no_emoji=True)
print("Unicode String - " + unicode_str_2 + " ; ASCII String (discard emojis) - " + ascii_converted_str_3)

Explanation

  • Line 1 : We import the cleantext package.

  • Line 3 : We define a Unicode string called unicode_str_1.

  • Line 4: We convert unicode_str_1 to ASCII using to_ascii_unicode() method. The ASCII string is called ascii_converted_str_1.

  • Line 5: We print ascii_converted_str_1 and unicode_str_1.

  • Line 7: We define a Unicode string with emojis called unicode_str_2.

  • Line 8: We convert unicode_str_2 to ASCII using the to_ascii_unicode() method. The ASCII string is called ascii_converted_str_2.

  • Line 9: We print ascii_converted_str_2 and unicode_str_2.

  • Line 11 : We convert unicode_str_2 to ASCII using the to_ascii_unicode() method with English as the language. We pass True as a value to the no_emoji parameter to remove emojis from the text. The ASCII string is called ascii_converted_str_3.

  • Line 12: We print ascii_converted_str_3 and unicode_str_2.

Free Resources