What is fix_bad_unicode() method of clean-text package in Python?
What is the clean-text package?
clean-text is a third-party package that is used to pre-process text data to obtain a normalized text representation.
The package can be installed via pip. Check out the following command to install the clean-text package.
pip install clean-text
The fix_bad_unicode() method
The fix_bad_unicode() method is used to fix the Unicode text that’s broken with the help of the ftfy package. Fixing bad Unicode includes fixing mojibake, HTML entities, other code cruft, and non-standard forms for display purposes.
Method signature
fix_bad_unicode(text, normalization="NFC")
Parameters
text: The text data.normalization: The type of unicode normalization.
Return value
The method returns good Unicode data.
Code
import cleantextstring = "✔ No problems"fixed_string = cleantext.fix_bad_unicode(string)print("Original String - '" + string + "'")print("Fixed String - '" + fixed_string + "'")
Code explanation
- Line 1:
clean-textpackage is imported. - Line 3: We define a string with bad Unicode characters in it.
- Line 5: The bad Unicode characters in the string are corrected using the
fix_bad_unicode()method. - Lines 7–8: We print the original and modified string.