...

/

Decode Your Data

Decode Your Data

Understand the data’s shape, patterns, and questions before analysis.

Before we create any charts or build reports, we need to pause and ask an essential question: “What does this data actually look like, and what might be worth exploring?”

In this lesson, we dive into exploratory data analysis (EDA), that is, our first real conversation with the data. It’s how we start making sense of things. We’re not cleaning anymore; we’ve done that work. Now, we’re here to notice what stands out, what connects, and what’s hiding beneath the surface.

Think of it like meeting the data for the first time. We’re curious, observant, and open to what it might reveal. As we explore, we begin asking:

  • What variables seem to move together?

  • What patterns are worth a second look?

  • What’s the overall shape and structure of the data?

We’re not rushing to conclusions. We’re training our eyes to spot what matters, so we can tell clearer, sharper stories later. Let’s begin exploring with intent, and see what the data starts to tell us.

What is EDA?

Every dataset holds a story, but that story isn’t always immediately clear. Before we create charts, run summaries, or share insights, we need to understand the data’s structure, patterns, and oddities. That’s where exploratory data analysis (EDA) comes in. It’s how data analysts get familiar with the data, uncover what’s worth highlighting, and spot anything that might affect the integrity of the analysis.

Think of EDA like opening the first chapter of a mystery novel. We’re not solving the case yet. We’re getting familiar with the characters (our variables), checking for surprises (like missing values or strange outliers), and trying to understand the setting (how the data is shaped).

Informational note: The term exploratory data analysis (EDA) was coined by statistician John Tukey in the 1970s. His idea? Don’t just crunch numbers, explore them visually and intuitively to spark questions before conclusions.

The key steps of exploratory data analysis

Let’s break down the key steps of exploratory data analysis: what we actually do when we explore data, from summarizing distributions to spotting relationships.

Press + to interact
Key steps of exploratory data analysis
Key steps of exploratory data analysis
  1. Get to know the data basics: We start by taking a quick look at the data’s structure (its rows and columns) and gathering summary statistics. This helps us understand what kinds of variables we have and how much data we’re dealing with.

  2. Explore variables one at a time (univariate analysis): Next, we examine each variable individually. We want to understand its distribution, common values, and any oddities like outliers or missing data.

  3. Look at relationships between variables (bivariate and multivariate analysis): Then, we study how variables interact. Are some variables correlated? Are there patterns when we group data by categories? This step uncovers connections that can be important for deeper analysis.

  4. Visualize the data: Visualization plays a huge role in EDA. Charts like histograms, box plots, scatterplots, and bar plots help us see patterns, spot anomalies, and communicate findings clearly.

  5. Iterate between exploring and cleaning: As we explore, we often find data issues, like missing values, inconsistencies, or errors. We then clean or transform the data, and revisit the exploration. This iterative cycle continues until the data is well understood and ready.

  6. Use insights to guide deeper analysis: Finally, the insights from EDA help us ask better questions, select features, and build more effective analyses.

Fun fact: Many data analysts say EDA is like a conversation with the data: the more we ask, the more it reveals!

With the key steps in mind, we begin where every good analyst begins: with a quick initial scan of the data.

Quick inspection

The first thing we do with any new dataset is take a look around. A few simple checks can tell us a lot: what kind of variables we’re dealing with, how the data is structured, and what might be worth visualizing.

We’re looking for answers to questions like:

  • Are we working with categories, numbers, or both?

  • What does the data structure look like?

  • Which columns might be worth comparing or breaking down visually?

Variable types

Understanding variable types in a dataset is a crucial first step in exploratory data analysis (EDA). It affects everything, from how we summarize values to how we visualize patterns or detect issues. Misclassify a variable, and we risk making incorrect assumptions. For instance, treating dates as text can break time-based plots or sort events out of order, like showing December before March.

Fun fact: Misinterpreting a variable type is like trying to measure temperature with a ruler: we’ll get numbers, but they won't make sense!

Here are the three most common variable types we typically encounter:

1. Categorical

Categorical variables group data into distinct categories or labels. To analyze them effectively, bar charts, count plots, and pie charts are commonly used. Correctly identifying categorical variables helps avoid mistakes like calculating meaningless averages and ensures accurate grouping and comparison.

They can be:

  • Nominal: No natural order (e.g., gender, payment method).

  • Ordinal: Have a logical order (e.g., education level: High School < Bachelor < Master).

Variable Name

Example Values

gender

'Male', 'Female', 'Other'

payment_method

'Visa', 'PayPal', 'Cash'

region

'North', 'South', 'East'

education_level

'High School', 'Master', 'PhD'

We can count how often each value appears, but we don’t calculate averages. These variables are essential for grouping, filtering, or comparing categories in the data.

2. Numerical

Numerical variables represent measurable quantities that can be counted or measured. Common visualizations for numerical data include histograms, box plots, and scatter plots. Correctly identifying numerical variables ensures we apply appropriate statistical summaries like averages, and avoid treating them as categories.

Numerical variables can be:

  • Continuous: Can take any value within a range (e.g., income, height).

  • Discrete: Countable values, often integers (e.g., number of purchases, age in years).

Variable Name

Example Values

age

25, 40, 35, 50

income

50000, 60000, 58000

purchase_amount

200, 150, 300

These variables are essential for calculations, trend analysis, and examining relationships between variables.

3. Datetime

Datetime variables represent dates and times, showing order, duration, or specific moments. Line charts and time series plots are best for visualizing datetime data. Correctly recognizing datetime variables helps us analyze trends over time and avoid errors like sorting dates as text. Here are some examples of datetime variables:

Variable Name

Example Values

order_date

2024-01-15, 2024-05-10

login_timestamp

2024-06-08 14:35:00

event_time

09:30:00, 18:45:00

Datetime variables track when events occur, durations, or timestamps. They enable time-based grouping and forecasting.

Scanning the dataset

Let’s explore five quick tools in pandas that help us scan any new dataset. These tools give us an immediate sense of structure, size, and variable types; these are all critical for deciding how to explore and visualize the data.

1. head()—What’s in this data?

The head() function shows the first few rows of the dataset, providing a quick look at the values in each column. This helps us guess if a column holds categories (like “Male or M”), numbers (like “25”), or dates (like “2023-05-01”).

Press + to interact
Python 3.10.4
Files
customer_id,age,income,gender,purchase_amount
1,25,50000,M,200
2,40,60000,F,150
3,35,58000,F,300
4,50,52000,M,250
5,23,48000,F,100

Calling df.head() displays the first five rows of the dataset, giving us a quick look at columns like customer_id, age, income, gender, and purchase_amount.

2. shape—How much data is there?

The shape attribute reveals the number of rows and columns in the dataset. Knowing the size helps decide if we can plot raw data or need to summarize or sample it first for clearer visuals.

Press + to interact
Python 3.10.4
Files
import pandas as pd
df = pd.read_csv("data.csv")
# Rows and columns
print(df.shape)

Calling df.shape gives (5, 5), indicating 5 records with 5 features each.

3. dtypes—What type is each variable?

The dtypes attribute shows the data type of each column (object, int64, float64, datetime64). This is key to understanding how pandas interprets each variable and whether conversions (e.g., strings to dates) are needed.

Understanding data types is essential because it tells us how pandas currently interprets the data and informs the types of visualizations that make sense. For example, object often means categorical data stored as strings, while int64 and float64 represent numerical values.

Datetime data might appear as datetime64, but sometimes it’s stored as an object type and requires conversion for proper time-based analysis.

Press + to interact
Python 3.10.4
Files
import pandas as pd
df = pd.read_csv("data.csv")
# Rows and columns
print(df.dtypes)

Calling df.dtypes returns a pandas Series whose values are the data types of each DataFrame column, and the Series itself is labeled with dtype: object.

Fun fact: dtypes helps us speak the data’s language; without it, we might try to do math with words!

4. info()—What are we working with?

The info() method summarizes column data types, counts non-null values, highlights missing data, and shows memory usage. This overview helps us assess data completeness, confirm variable types, and understand the dataset’s size before analysis and visualization.

Press + to interact
Python 3.10.4
Files
import pandas as pd
df = pd.read_csv("data.csv")
# Data types and missing values
print(df.info())

For instance, a column with type object may represent labels or categories, while a int64 column likely holds numeric values, which are candidates for histograms, box plots, or scatter plots.

5. describe()—What are the summary statistics?

The describe() method provides summary stats: for numeric data, it shows mean, min, max, and quartiles; for categorical data, frequency counts. This helps spot outliers, distribution shapesDistribution shapes describe how data values are spread across a range. Common shapes include normal (bell curve), skewed (left or right), uniform, and bimodal, each revealing different patterns or anomalies in the data., and data quality issues.

Press + to interact
Python 3.10.4
Files
import pandas as pd
df = pd.read_csv("data.csv")
# Rows and columns
print(df.describe())

Calling df.describe() gives a statistical summary of numeric columns, including count, mean, standard deviation, min, and quartiles. For instance, it shows the average income is 53,60053,600 and the average purchase amount is 200200 across 55 customers.

Asking the right questions early

After identifying our variable types and scanning the dataset, the next step is to ask smart, focused questions. These questions guide our exploration, help uncover patterns, and shape the story the data wants to tell.

Informational note: Great dashboards or charts start with great questions. Analysts at top companies often run EDA sessions in teams to align on what to explore, before building any visual.

Here are some common questions we can ask:

  • Which variables could influence our outcome later?
    Thinking ahead about which variables might explain patterns or differences in the data helps us ground our exploration. By identifying potentially influential factors early, we can examine them more closely, compare them across groups, and better understand their role in the bigger picture.

  • What relationships or comparisons might be worth plotting?
    Looking for patterns across categories, time periods, or groups can reveal important trends and anomalies. Visualizing distributions, correlationsCorrelations describe how two variables move in relation to each other. A strong correlation suggests that when one variable changes, the other tends to change in a predictable way., and differences between groups provides insights that raw data alone might miss.

  • Are there unexpected or inconsistent values?
    During EDA, visualizations often reveal data quality issues such as missing values where data should exist or inconsistencies in data types. Spotting these early helps prevent misleading conclusions, and guides necessary cleaning steps before further analysis.

  • Which columns might be interesting to explore further?
    Variables indicating wide variation, unusual distributions, or unexpected values often highlight areas worth a deeper investigation. These insights help focus analysis on the most informative parts of the dataset.

  • Which values stand out as outliers or anomalies?
    Outliers might represent important insights or data errors. Distinguishing between intentional unusual points and errors that need removal is key for maintaining data integrity.

These early questions aren’t just a checklist; they’re how we begin transforming raw data into insight. By asking them thoughtfully, we start noticing patterns, discovering outliers, and generating hypotheses. In essence, this is where data exploration becomes data storytelling.

Wrap-up

In this lesson, we learned that getting to know our data isn’t just a technical step: it’s a mindset. By understanding variable types, we choose the right tools to explore. By asking thoughtful early questions, we shape our path toward discovery.

Exploratory data analysis is like meeting a new teammate. The more time we spend understanding what the data says (and what it doesn’t), the better decisions we’ll make down the line. Patterns become clearer. Problems become solvable. Stories begin to take shape.

Technical Quiz

1

Which function in pandas gives us a quick preview of the first few rows of a DataFrame?

A)

shape()

B)

info()

C)

head()

D)

describe()

Question 1 of 50 attempted