What is fix_bad_unicode() method of clean-text package in Python?

What is the clean-text package?

clean-text is a third-party package that is used to pre-process text data to obtain a normalized text representation.

The package can be installed via pip. Check out the following command to install the clean-text package.

pip install clean-text

The fix_bad_unicode() method

The fix_bad_unicode() method is used to fix the Unicode text that’s broken with the help of the ftfy package. Fixing bad Unicode includes fixing mojibake, HTML entities, other code cruft, and non-standard forms for display purposes.

Method signature

fix_bad_unicode(text, normalization="NFC")

Parameters

  • text: The text data.
  • normalization: The type of unicode normalization.

Return value

The method returns good Unicode data.

Code

import cleantext
string = "✔ No problems"
fixed_string = cleantext.fix_bad_unicode(string)
print("Original String - '" + string + "'")
print("Fixed String - '" + fixed_string + "'")

Code explanation

  • Line 1: clean-text package is imported.
  • Line 3: We define a string with bad Unicode characters in it.
  • Line 5: The bad Unicode characters in the string are corrected using the fix_bad_unicode() method.
  • Lines 7–8: We print the original and modified string.