Getting Started with the ggplot2 Package
Explore the fundamentals of ggplot2 in R, including the grammar of graphics and layered plot construction. Understand how ggplot2 differs from the base graphics package and why it is preferred for creating clear, customizable data visualizations using fewer lines of code.
Overview of the base graphics package in R
The base graphics package is contained in the graphics package distributed in base R. These are often used for creating data graphics, and the system provides many global parameters that can be set and manipulated to create a data chart.
Let’s quickly understand how to create plots using the base graphics package before introducing the ggplot2 package. The base graphics package in
R offers several high-level functions, such as:
plot(): This is a generic function that can plot a variety of plots.boxplot(): This creates single and side-by-side boxplots.hist(): This function creates histograms.qqplot(),qqnorm(), andqqline(): These functions build quantile plots.dotchart()andstripchart(): These functions are for creating dot plots.image(),contour(), andpersp(): These functions create 3D plots.pairs(): This function plots scatter plot matrices.
Additionally, we can find simple functions like lines(), plots(), symbols(), arrows(), and many more in the base graphics package for creating data visualizations.
Now, let’s try a few examples to understand how data visualizations can be created using the base graphics package.
First, we’ll prepare a dataset with random numbers using the following code:
Histogram
A simple histogram from the above dataset data can be plotted using the hist() function in the graphics package.
Run the following code to launch the histogram:
Scatter plot
Another example we can try with the data dataset is a scatter plot. The following example will plot the scatter for data points in the y and z columns using the plot() command.
The above scatter plot can be re-plotted to include information about the different categories using colors. We can also add the legend for this plot using a separate line of code.
The base graphics package builds the base plot in two phases:
- Initializing a new plot
- Annotating an existing plot
The generic plot() function in R accepts two vectors (one for the x-axis and the other for the y-axis coordinates).
Next, let’s look at some important global parameters used for the base graphics plots:
-
pch: This is the plotting symbol parameter (default = open circle). -
lty: This is the line type parameter (default = solid line but can be dashed, dotted, etc.). -
lwd: This is the line width parameter (defined as an integer multiple). -
col: This is the plotting color parameter (defined as a number, string, or hex code). -
xlab: This is the character string for the x-axis label. -
ylab: This is the character string for the y-axis label.
Although the base graphics package helps to create simple visualizations, customizing them will require additional lines of code, increasing the complexity and the time required. Here, the ggplot2 package can be a better alternative for beginners as well as for individuals already using base R graphics since it reduces the requirement to write several lines of code for the same type of graph.
We have seen the approach to building simple visualizations in the base graphics package in R.
Let’s explore the ggplot2 package and learn how to generate different data visualizations using the package.
Introduction to the tidyverse package
The tidyverse package is a collection of multiple R packages designed for data science that share the same design philosophy. The included packages serve different data science needs, such as data exploration, manipulation, visualization, and more. We’ll find the following well-known packages in the core tidyverse package:
- The
readrpackage is for data importing and reading rectangular (tabular) data. - The
tidyrpackage is for tidying data. - The
dplyrpackage is for data transformation. - The
tibblepackage is for handling data frames. - The
stringrpackage is for working with strings. - The
purrrpackage is for working with functions and vectors. - The
forcatspackage is for working with factors and handling categorical variables in a dataset. - The
ggplot2package is for data exploration and visualization.
Apart from the packages mentioned above, the tidyverse package offers several other packages that work well together. We can find more information on the tidyverse packages on the official tidyverse website.
This concludes the necessary background for data visualization in R. It brings us to the main hero of this course—the ggplot2 package.
What is the ggplot2 package?
The ggplot2 package is a popular data visualization package, considered an alternative to the base graphics package in R. It is an open-source declarative package developed by Hadley Wickham in 2005 for generating various types of data visualizations in R. The gg in ggplot2 means grammar of graphics, a graphics concept that describes plots using grammar. The ggplot2 package is an application of the concepts in Leland Wilkinson’s book, “The Grammar of Graphics,” whose purpose was to lay down a set of broad unifying principles for data presentation.
The ggplot2 package includes a few fundamental functions, making it simple to understand and use. It is possible to combine these functions in various ways to create multiple visuals based on the grammar of graphics.
Why is the ggplot2 package so popular?
The ggplot2 package sets reasonable default values quickly to allow users to create pretty, hassle-free data visualizations, such as automatically adding legends to the plots. These default values enable users to work with the ggplot2 package without knowing the underlying grammar.
Overall, this approach signifies that users can focus on creating the chart that best reveals the story in their data instead of worrying about how to make the charts pleasant and eye-catching.
Nevertheless, some knowledge of the grammar of graphics helps build charts based on concepts rather than recalling commands and options. It assists in creating better and improved charts.
This declarative style of the ggplot2 package indicates that the data visualizations can be built iteratively, i.e., adding one layer at a time. For example, we can start with one layer including the raw data, and then subsequently add more layers for annotations and so on. This approach is similar to the way we think and analyze data. Therefore, the ggplot2 package makes it easy to build complex graphics iteratively.