Search⌘ K
AI Features

Scatterplots

Explore how to build scatterplots using the ggplot2 package in R's Tidyverse framework. Understand the components of the grammar of graphics, filter datasets for focused analysis, and interpret the relationships between numerical variables visually.

Needed packages

Let’s load all the packages needed for the upcoming programs:

library(nycflights13)
library(ggplot2)
library(dplyr)
Load packages

Five named graphs: The 5NG

In order to keep things simple in this course, we’ll only focus on five different types of graphics, each with a common given name. We term these the five named graphs, or in abbreviated form, the 5NG:

  1. Scatterplots

  2. Line graphs

  3. Boxplots

  4. Histograms

  5. Barplots

Overview of scatterplots

The simplest of the 5NG are scatterplots, also called bivariate plots. They allow us to visualize the relationship between two numerical variables. While we might already be familiar with scatterplots, let’s view them through the lens of the grammar of graphics. Specifically, we’ll visualize the relationship between the following two numerical variables in the flights data frame included in the nycflights13 package:

  • dep_delay: This is the departure delay on the horizontal x-axis.

  • arr_delay: This is the arrival delay on the vertical y-axis.

For Alaska Airlines flights leaving NYC in 2013, this requires paring down the data from all 336,776 flights that left NYC in 2013 to only the 714 Alaska Airlines flights that left NYC in 2013. We do this so our scatterplot will involve a manageable 714 points and not an overwhelmingly large number like 336,776. To achieve this, we’ll take the flights data frame, filter the rows so that only the 714 rows corresponding to Alaska Airlines flights are kept. Then, we’ll save this in a new data frame called alaska_flights using the <- assignment operator:

R
alaska_flights <- flights %>% filter(carrier == "AS")

For now, we shouldn’t worry if we don’t fully understand this code. When covering data wrangling, we’ll learn how this code uses the dplyr package to achieve our goal. It takes the flights data frame and applies a filter to it. It only returns the rows where the carrier is equal to AS, which is Alaska Airlines’ carrier code. Recall that testing for equality is specified with == and not =. We’ll see that it has 714 rows, consisting of only 714 Alaska Airlines flights.

Scatterplots via geom_point

Let’s now go over the code that will create the desired scatterplot while keeping in mind the grammar of the graphics framework. Let’s take a look at the code and break it down piece by piece.

Within the ggplot() function, we specify two of the components of the grammar of graphics as arguments (i.e., inputs):

  • The data is the alaska_flights data frame via data = alaska_flights.

  • The aesthetic mapping is done by setting mapping = aes(x = dep_delay, y = arr_delay). Specifically, the variable dep_delay maps to the x position aesthetic, while the variable arr_delay maps to the y position.

We then add a layer to the ggplot() function call using the + sign. The added layer in question specifies the third component of the grammar that is the geometric object. In this case, the geometric object is set to be points by specifying geom_point(). After running these two lines of code in our console, we’ll notice two outputs. First, there will be a warning message and the graphic shown below.

Note: The warning message is that the five rows contain the missing values (geom_point()) were removed.

R
alaska_flights <- flights %>% filter(carrier == "AS")
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()

Note: The graph can be zoomed in.

Let’s take a closer look at the graph. Observe that a positive relationship exists between dep_delay and arr_delay—as departure delays increase, arrival delays also tend to increase. Also observe the large mass of points clustered near (0, 0), the point indicating flights that neither departed nor arrived late.

Let’s turn our attention to the warning message. R is alerting us to the fact that five rows were ignored due to them being missing. For these five rows, either the value for dep_delay or arr_delay or both were missing (recorded in R as NA), and thus, these rows were ignored in our plot.

Before we continue, let’s make a few more observations about this code that created the scatterplot. Note that the + sign comes at the end of the lines and not at the beginning. We’ll get an error in R if we put it at the beginning of a line. When adding layers to a plot, we’re encouraged to start a new line after the + (by pressing the “Enter” button on our keyboard) so that the code for each layer is on a new line. As we add more and more layers to the plots, we’ll see this will greatly improve the legibility of our code.

To stress the importance of adding the layer specifying the geometric object, consider the next graph where no layers are added. The geometric object, not being specified, leaves us with a blank plot that isn’t very useful!

R
alaska_flights <- flights %>% filter(carrier == "AS")
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))