Decode Your Data
Understand the data’s shape, patterns, and questions before analysis.
We'll cover the following...
Before we create any charts or build reports, we need to pause and ask an essential question: “What does this data actually look like, and what might be worth exploring?”
In this lesson, we dive into exploratory data analysis (EDA), that is, our first real conversation with the data. It’s how we start making sense of things. We’re not cleaning anymore; we’ve done that work. Now, we’re here to notice what stands out, what connects, and what’s hiding beneath the surface.
Think of it like meeting the data for the first time. We’re curious, observant, and open to what it might reveal. As we explore, we begin asking:
What variables seem to move together?
What patterns are worth a second look?
What’s the overall shape and structure of the data?
We’re not rushing to conclusions. We’re training our eyes to spot what matters, so we can tell clearer, sharper stories later. Let’s begin exploring with intent, and see what the data starts to tell us.
What is EDA?
Every dataset holds a story, but that story isn’t always immediately clear. Before we create charts, run summaries, or share insights, we need to understand the data’s structure, patterns, and oddities. That’s where exploratory data analysis (EDA) comes in. It’s how data analysts get familiar with the data, uncover what’s worth highlighting, and spot anything that might affect the integrity of the analysis.
Think of EDA like opening the first chapter of a mystery novel. We’re not solving the case yet. We’re getting familiar with the characters (our variables), checking for surprises (like missing values or strange outliers), and trying to understand the setting (how the data is shaped).
Informational note: The term exploratory data analysis (EDA) was coined by statistician John Tukey in the 1970s. His idea? Don’t just crunch numbers, explore them visually and intuitively to spark questions before conclusions.
The key steps of exploratory data analysis
Let’s break down the key steps of exploratory data analysis: what we actually do when we explore data, from summarizing distributions to spotting relationships.
Get to know the data basics: We start by taking a quick look at the data’s structure (its rows and columns) and gathering summary statistics. This helps us understand what kinds of variables we have and how much data we’re dealing with.
Explore variables one at a time (univariate analysis): Next, we examine each variable individually. We want to understand its distribution, common values, and any oddities like outliers or missing data.
Look at relationships between variables (bivariate and multivariate analysis): Then, we study how variables interact. Are some variables correlated? Are there patterns when we group data by categories? This step uncovers connections that can be important for deeper analysis.
Visualize the data: Visualization plays a huge role in EDA. Charts like histograms, box plots, scatterplots, and bar plots help us see patterns, spot anomalies, and communicate findings clearly.
Iterate between exploring and cleaning: As we explore, we often find data issues, like missing values, inconsistencies, or errors. We then clean or transform the data, and revisit the exploration. This iterative cycle continues until the data is well understood and ready.
Use insights to guide deeper analysis: Finally, the insights from EDA help us ask better questions, select features, and build more effective analyses.
Fun fact: Many data analysts say EDA is like a conversation with the data: the more we ask, the more it reveals!
With the key steps in mind, we begin where every good analyst begins: with a quick initial scan of the data.
Quick inspection
The first thing we do with any new dataset is take a look around. A few simple checks can tell us a lot: what kind of variables we’re dealing with, how the data is structured, and what might be worth visualizing.
We’re looking for answers to questions like:
Are we working with categories, numbers, or both?
What does the data structure look like?
Which columns might be worth comparing or breaking down visually?
Variable types
Understanding variable types in a dataset is a crucial first step in exploratory data analysis (EDA). It affects everything, from how we summarize values to how we visualize patterns or detect issues. Misclassify a variable, and we risk making incorrect assumptions. For instance, treating dates as text can break time-based plots or sort events out of order, like showing December before March.
Fun fact: Misinterpreting a variable type is like trying to measure temperature with a ruler: we’ll get numbers, but they won't make sense!
Here are the three most common variable types we typically encounter:
1. Categorical
Categorical variables group data into distinct categories or labels. To analyze them effectively, bar charts, count plots, and pie charts are commonly used. Correctly identifying categorical variables helps avoid mistakes like calculating meaningless averages and ensures accurate grouping and comparison.
They can be:
Nominal: No natural order (e.g., gender, payment method).
Ordinal: Have a logical order (e.g., education level: High School < Bachelor < Master).
Variable Name | Example Values |
|
|
|
|
|
|
|
|
We can count how often each value appears, but we don’t calculate averages. These variables are essential for grouping, filtering, or comparing categories in the data.
2. Numerical
Numerical variables represent measurable quantities that can be counted or measured. Common visualizations for numerical data include histograms, box plots, and scatter plots. Correctly identifying numerical variables ensures we apply appropriate statistical summaries like averages, and avoid treating them as categories.
Numerical variables can be:
Continuous: Can take any value within a range (e.g., income, height).
Discrete: Countable values, often integers (e.g., number of purchases, age in years).
Variable Name | Example Values |
|
|
|
|
|
|
These variables are essential for calculations, trend analysis, and examining relationships between variables.
3. Datetime
Datetime variables represent dates and times, showing order, duration, or specific moments. Line charts and time series plots are best for visualizing datetime data. Correctly recognizing datetime variables helps us analyze trends over time and avoid errors like sorting dates as text. Here are some examples of datetime variables:
Variable Name | Example Values |
|
|
|
|
|
|
Datetime variables track when events occur, durations, or timestamps. They enable time-based grouping and forecasting.
Scanning the dataset
Let’s explore five quick tools in pandas that help us scan any new dataset. These tools give us an immediate sense of structure, size, and variable types; these are all critical for deciding how to explore and visualize the data.
1. head()
—What’s in this data?
The head()
function shows the first few rows of the dataset, providing a quick look at the values in each column. This helps us guess if a column holds categories (like “Male or M”), numbers (like “25”), or dates (like “2023-05-01”).
customer_id,age,income,gender,purchase_amount1,25,50000,M,2002,40,60000,F,1503,35,58000,F,3004,50,52000,M,2505,23,48000,F,100
Calling df.head()
displays the first five rows of the dataset, giving us a quick look at columns like customer_id
, age
, income
, gender
, and purchase_amount
.
2. shape
—How much data is there?
The shape
attribute reveals the number of rows and columns in the dataset. Knowing the size helps decide if we can plot raw data or need to summarize or sample it first for clearer visuals.
import pandas as pddf = pd.read_csv("data.csv")# Rows and columnsprint(df.shape)
Calling df.shape
gives (5, 5)
, indicating 5 records with 5 features each.
3. dtypes
—What type is each variable?
The dtypes
attribute shows the data type of each column (object
, int64
, float64
, datetime64
). This is key to understanding how pandas interprets each variable and whether conversions (e.g., strings to dates) are needed.
Understanding data types is essential because it tells us how pandas currently interprets the data and informs the types of visualizations that make sense. For example, object
often means categorical data stored as strings, while int64
and float64
represent numerical values.
Datetime data might appear as datetime64
, but sometimes it’s stored as an object type and requires conversion for proper time-based analysis.
import pandas as pddf = pd.read_csv("data.csv")# Rows and columnsprint(df.dtypes)
Calling df.dtypes
returns a pandas Series whose values are the data types of each DataFrame column, and the Series itself is labeled with dtype: object
.
Fun fact: dtypes
helps us speak the data’s language; without it, we might try to do math with words!
4. info()
—What are we working with?
The info()
method summarizes column data types, counts non-null values, highlights missing data, and shows memory usage. This overview helps us assess data completeness, confirm variable types, and understand the dataset’s size before analysis and visualization.
import pandas as pddf = pd.read_csv("data.csv")# Data types and missing valuesprint(df.info())
For instance, a column with type object
may represent labels or categories, while a int64
column likely holds numeric values, which are candidates for histograms, box plots, or scatter plots.
5. describe()
—What are the summary statistics?
The describe()
method provides summary stats: for numeric data, it shows mean, min, max, and quartiles; for categorical data, frequency counts. This helps spot outliers,
import pandas as pddf = pd.read_csv("data.csv")# Rows and columnsprint(df.describe())
Calling df.describe()
gives a statistical summary of numeric columns, including count, mean, standard deviation, min, and quartiles. For instance, it shows the average income is
Asking the right questions early
After identifying our variable types and scanning the dataset, the next step is to ask smart, focused questions. These questions guide our exploration, help uncover patterns, and shape the story the data wants to tell.
Informational note: Great dashboards or charts start with great questions. Analysts at top companies often run EDA sessions in teams to align on what to explore, before building any visual.
Here are some common questions we can ask:
Which variables could influence our outcome later?
Thinking ahead about which variables might explain patterns or differences in the data helps us ground our exploration. By identifying potentially influential factors early, we can examine them more closely, compare them across groups, and better understand their role in the bigger picture.What relationships or comparisons might be worth plotting?
Looking for patterns across categories, time periods, or groups can reveal important trends and anomalies. Visualizing distributions, , and differences between groups provides insights that raw data alone might miss.correlations Correlations describe how two variables move in relation to each other. A strong correlation suggests that when one variable changes, the other tends to change in a predictable way. Are there unexpected or inconsistent values?
During EDA, visualizations often reveal data quality issues such as missing values where data should exist or inconsistencies in data types. Spotting these early helps prevent misleading conclusions, and guides necessary cleaning steps before further analysis.Which columns might be interesting to explore further?
Variables indicating wide variation, unusual distributions, or unexpected values often highlight areas worth a deeper investigation. These insights help focus analysis on the most informative parts of the dataset.Which values stand out as outliers or anomalies?
Outliers might represent important insights or data errors. Distinguishing between intentional unusual points and errors that need removal is key for maintaining data integrity.
These early questions aren’t just a checklist; they’re how we begin transforming raw data into insight. By asking them thoughtfully, we start noticing patterns, discovering outliers, and generating hypotheses. In essence, this is where data exploration becomes data storytelling.
Wrap-up
In this lesson, we learned that getting to know our data isn’t just a technical step: it’s a mindset. By understanding variable types, we choose the right tools to explore. By asking thoughtful early questions, we shape our path toward discovery.
Exploratory data analysis is like meeting a new teammate. The more time we spend understanding what the data says (and what it doesn’t), the better decisions we’ll make down the line. Patterns become clearer. Problems become solvable. Stories begin to take shape.
Technical Quiz
Which function in pandas gives us a quick preview of the first few rows of a DataFrame?
shape()
info()
head()
describe()