How to sanitize user input in Python
Sanitizing user input is a critical step in development. Since our clients come from all over the world, we need to be cautious and ensure our operations are secure. It's the developer's responsibility to ensure that the program remains secure and free of any malicious input so that the service can function properly.
We must take the following two precautions to ensure that the input is valid and secure for our system:
Input validation: Ensuring that the input is well-formed and in the expected structure.
Input sanitization: Ensuring data is semantically and logically correct and safe to use in the system's workflow.
Note: The implementation of the above measures differ based on the technology and context of the service. For example, input sanitization for web applications may involve stripping HTML and JavaScript tags from user input, while IoT devices may need to sanitize input data from sensors or other sources.
Let's see how to sanitize user input using different techniques available in Python.
Sanitizing user input in Python
We can remove unnecessary or malformed data from our input using different techniques, some of which are listed in the table below:
Technique | Description |
Escape characters | Escape special characters from the input using |
Third-party libraries | Sanitizing inputs using third-party libraries and frameworks, such as |
Regular expressions | Only allowing expected data by blocklisting or allowlisting inputs, such as using the |
Let's take a simple example from each of the above categories and implement it in code for better understanding.
Sanitizing using html.escape()
We use the html module of Python to escape characters that have special meanings in HTML.
import htmldef sanitize_input(input_str):# Escaping special characters in HTMLsanitized_str = html.escape(input_str)return sanitized_struser_input = '<span>Hi!</span>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Explanation
Lines 3–6: We define
sanitize_input, which returns sanitized data after processing it.Line 5: We call the
html.escape()function to replace each character with a special meaning with its alternate escape value.Line 8: We store input data from the user in the
user_input.Line 9: We use the
sanitize_inputmethod to process the input data.Line 10: We display the sanitized output. The output only contains
<span>Hi!</span>instead of<span>Hi!</span>, which means the special characters were replaced successfully.
Sanitizing using bleach
We use Python's bleach library to allow only
import bleach# List of allowed HTML tagsallowed_tags = ['span', 'b']def sanitize_input(input_str):# Allowing only allowlisted tags using bleach librarysanitized_str = bleach.clean(input_str, tags=allowed_tags)return sanitized_struser_input = '<span>Hi!</span> <style> span { color: #ff0000; } </style>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Explanation
Line 1: We import the
bleachlibrary to our code.Line 4: We define the list of allowed tags in the user input.
Lines 6–9: We define
sanitize_input, which returns sanitized data after processing it.Line 8: We call the
bleach.clean()function to only allow only allowlisted tags and replace the rest with their alternate escape value.Line 11: We store input data from the user in the
user_input.Line 12: We use the
sanitize_inputmethod to process the input data.Line 13: We display the sanitized output. The color of the output is not changed, which means only the allowlisted tags were interpreted successfully by the browser.
Sanitizing using re
We use the re module of Python to blocklist script tags using regular expressions from the user input.
import redef sanitize_input(input_str):# Regular expression to blocklist script tagssanitized_str = re.sub(r'<script\b[^>]*>(.*?)</script>', '', input_str, flags=re.IGNORECASE)return sanitized_struser_input = '<span>Hi!</span> <script>alert("Hello from script!");</script>'sanitized_input = sanitize_input(user_input)print(sanitized_input)
Explanation
Line 1: We import the
remodule in our code.Lines 3–6: We define
sanitize_input, which returns sanitized data after processing it.Line 5: We call the
re.sub()function to remove unnecessary data that matches the regular repression provided inr. Theflags=re.IGNORECASEtells theremodule to ignore the case of characters when matching the regular expression.Line 8: We store input data from the user in the
user_input.Line 9: We use the
sanitize_inputmethod to process the input data.Line 10: We display the sanitized output. The output only contains
<span>Hi!</span>, which means the script tag was removed successfully.
Free Resources