Missing Data Representation

Learn the different ways in which missing data is represented in pandas.

Introduction

Dealing with missing data is an essential aspect of data analysis. The data we receive is often incomplete, with missing values that need to be managed. Given that missing data can significantly affect the outcomes of our analysis or models, it’s important that we know how to work with missing values so that their negative impact is minimized.

Over the next few lessons, we’ll discover how to leverage the robust methods in pandas to represent, detect, analyze, and manage missing data.

Representation of missing data

Let's start by exploring how missing data is represented and displayed in pandas.

General representations

The two common missing data representations in pandas are NaN (an acronym for not a number) and None. Although NaN is considered the default missing value indicator for reasons of computational speed and convenience, it’s important to understand both representations because they have some key differences in their underlying data types.

Here are some details about each missing data representation:

  • NaN:

    • A special floating-point value from NumPy that specifically represents missing numerical data.

    • The default missing value marker in pandas for real or floating-point values. It is based on the IEEE 754 floating-point standard.

    • It’s of the floating-point type (rather than a Python object like None).

    • NaN is contagious in computations, which means that almost any operation involving NaN will also result in NaN. For example, if we perform an arithmetic operation with NaN and another number, the result is always NaN. This phenomenon is also known as the propagation of NaN in mathematical operations, which will be discussed in the next lesson.

    • The following code shows two ways we can generate NaN values:

Get hands-on with 1200+ tech skills courses.