Missing Data Representation
Learn the different ways in which missing data is represented in pandas.
Introduction
Dealing with missing data is an essential aspect of data analysis. The data we receive is often incomplete, with missing values that need to be managed. Given that missing data can significantly affect the outcomes of our analysis or models, it’s important that we know how to work with missing values so that their negative impact is minimized.
Over the next few lessons, we’ll discover how to leverage the robust methods in pandas
to represent, detect, analyze, and manage missing data.
Representation of missing data
Let's start by exploring how missing data is represented and displayed in pandas
.
General representations
The two common missing data representations in pandas
are NaN
(an acronym for not a number) and None
. Although NaN
is considered the default missing value indicator for reasons of computational speed and convenience, it’s important to understand both representations because they have some key differences in their underlying data types.
Here are some details about each missing data representation:
NaN
:A special floating-point value from
NumPy
that specifically represents missing numerical data.The default missing value marker in
pandas
for real or floating-point values. It is based on the IEEE 754 floating-point standard.It’s of the floating-point type (rather than a Python object like
None
).NaN
is contagious in computations, which means that almost any operation involvingNaN
will also result inNaN
. For example, if we perform an arithmetic operation withNaN
and another number, the result is alwaysNaN
. This phenomenon is also known as the propagation ofNaN
in mathematical operations, which will be discussed in the next lesson.The following code shows two ways we can generate
NaN
values:
Get hands-on with 1400+ tech skills courses.