Data Cleaning

Learn how to perform data cleaning in Altair, with a focus on handling missing values, managing duplicates, and manipulating data.

Data cleaning is all about identifying and correcting inaccuracies and inconsistencies in data, which makes it more reliable and easier to work with.

Data cleaning involves the following main aspects:

  • Handling missing values

  • Managing duplicates

  • Manipulating data (formatting, normalization, and standardization)

Altair provides some functions to perform data cleaning. However, in most cases, it is better to clean the data before passing them to Altair, and use Altair only to render the visualization.

Handling missing values

A missing value is simply a value that is not present in the data. There are many reasons why values might be missing from the data, such as errors in data collection, preprocessing, or intentional omission of values (e.g., for privacy reasons).

Missing values can cause problems when analyzing the data, so it is often desirable to deal with them in some way. One common approach is to remove all rows or columns that contain missing values. However, this can lead to loss of information and may not be appropriate in all cases. Another approach is to impute the missing values, which means replacing them with some estimated value.

In terms of data storytelling, we can employ two strategies to handle missing values:

  • We can leave them as they are and add an annotation to the chart that explains why they are missing.

  • We can remove them, only if they are in the lowest or greatest limit of the considered range of values. For example, if we have a dataset covering a period from 1990 to 2023, we’ll remove missing values only if they fall in correspondence with 1990 or 2023.

To deal with missing values, Altair defines three strategies:

  • The transform_filter() function

  • The impute argument via encodings

  • The transform_impute() function

The transform_filter() function

The transform_filter() function receives a filter expression as input, such as a condition. Within the transform_filter() function, to access the single cell of a DataFrame use the alt.datum element. The following code shows how to use the transform_filter() function to drop missing values.

Get hands-on with 1200+ tech skills courses.