Data Exploration and Error Checking

Whenever we start working with a dataset in R, we should first devote substantial time to checking it for errors. These are some questions we should ask ourselves:

For example, if a column that’s supposed to be numerical shows up as a factor, that likely indicates a typo where we accidentally have text in place of a number. Remember, each column in a data frame is a vector, and vectors can only have one mode. So, a vector with both numbers and characters is treated as if it’s all characters. Similarly, if we have a factor that should have three categories but imports with four, we likely have a typo—(for example, “predator” versus “predtaor”), and the misspelled version is showing up as a separate category. These sorts of mistakes are widespread!

Because this dataset has been very thoroughly examined, these types of errors aren’t present. However, we may want to change the names of columns or remove outliers, which we’ll cover in the subsequent sections.

Data structure

We begin by examining the structure of the data frame with the str() function.

Press + to interact

We can see that our dataset has 2502 observations of 14 different variables, some of which are integers, some are factors, and some are numerical. The following are things to notice:

Several variables are listed twice but are coded in different ways. For example, there’s a column titled Tank and one titled Tank.Unique. As stated earlier, there are 12 tanks in each of the eight blocks. The variable Tank lists what number a tank is (1 through 12) in a given block, whereas Tank.Unique provides each tank with a unique number out of the entire 96.
Similarly, we have the columns Age.DPO and Age.FromEmergence. The first column, Age.DPO, is the age of the frogs at the time of their emergence from the water in terms of days post-oviposition (DPO), whereas the Age.FromEmergence column counts the day the first animal crawled out of the water as day 1, so the age of the animals is recorded in terms of days relative to when the emergence began. Sometimes, it can be helpful to view the same data in two different ways.
We have three categorical predictors or factors: Hatching age, Predator treatment, and Resource level. Each factor has several levels or categories, which we can see in the str() output.
We have several response variables—for example, SVL or Mass—which are measured at the initial point when the froglets left the water, at the end of metamorphosis when the tail was fully resorbed, or both.

Data exploration and visualization

We begin by plotting the data to check for errors. The default plot() function creates a simple graphic based on the data we provide. To access a named variable within a data frame, we use the $ operator, as in data_frame$variable. The data frame always goes first, then the $ column’s name we are interested in. For example, we may type in the following to look at our data:

Press + to interact

The illustration generated by the code given above shows that the number of individuals surviving metamorphosis in the three Predator treatments varies considerably, from around 1,200 froglets in the Control group to approximately 500 in the Nonlethal group.

Note: The plot style changes depending on whether we plot a continuous variable or a factor. For the continuous variable, the default is to plot the data in order, from the first row to the last. In the case of a factor, the default is to plot the number of observations in each group.

Further data exploration and identifying mistakes

Plotting data by itself can be helpful. Let’s say we want to check for outliers or find typos (like making a numeric variable plot as a factor). However, it’s often more helpful to plot response data against an explanatory variable. For example, we may want to know how the final mass of metamorphs varies across predator treatment. Here, we use the ~ sign to separate our response variable from a predictor variable. Let’s examine the relationship of Mass.final and Predator treatment by plotting Mass.final~Pred.

Press + to interact

Note: Here are some essential things to take note of. By providing a categorical variable as our predictor, R automatically knew to make a box and whisker plot, also known as a boxplot. There aren’t many instances when R will think for us, but this is one where it will.

Looking at the plot generated by the code given above, there are several things to know about how R draws a boxplot.

First, the top and bottom of each box represent the interquartile range—that is, the middle 50% of our data. Thus, 25% of the metamorphs in each predator treatment are more significant than the top of their respective box, and 25% are smaller than the bottom.
Second, the heavy dark line in the middle of the box is the median, not the mean as many observers may initially think.
Third, the extremes of the “whiskers” are either of the following:
1. The maximum or minimum value of the data.
2. 1.5 times the interquartile range (IQR).
In the event of the second option, R plots all the points that fall beyond the 1.5 times of the IQR. So, what does that mean in practice? If we look at the plot generated by the code above, we can see that the bottom whiskers are all just that, a whisker. That means they have been plotted to the smallest value in the dataset and that that value falls within 1.5 times the IQR. The upper whiskers have many points above them, meaning that the whiskers extend to 1.5 times the IQR mark, and the points plotted above the whisker fall outside that range.
Lastly, notice that we’ve introduced a new syntax. We can use the ~ sign to denote a relationship between two vectors, usually thought of as response~predictor. This structure will be used later for defining statistical models and can be expanded to incorporate multiple predictors—for example, response~predictor1 + predictor2 +.

What happens if we plot two continuous variables against one another instead of a continuous response versus a categorical predictor? Maybe we want to see a relationship between mass at the end of metamorphosis and SVL at the end of metamorphosis. Since we have provided two continuous variables, R will know to make a scatterplot automatically.

Press + to interact

The plot generated by the code given above tells us several things.

There appear to be several outliers. These individuals have a very small SVL but a large mass, or vice versa. These almost certainly represent mistakes made during data entry since they’re biologically unrealistic, maybe even impossible, and should therefore be removed.
The relationship between SVL and mass isn’t linear. It curves upward, which indicates that longer frogs with greater SVL.final values seem to have disproportionately larger masses. This is expected in many length-to-mass relationships in nature, and perhaps plotting the data on logarithmic axes would make this relationship linear.

Now, let’s see if plotting the log-log axes makes the length-to-mass relationship linear. The following code takes the log of each variable and plots them against one another. Thus, the values on the axes will be in terms of the logarithm of either SVL or mass.

Press + to interact

Course Introduction

Introduction to R

Thoughts on Proper Data Analysis

Exploratory Data Analysis and Data Summarization

Introduction to Plotting

Basic Statistical Analysis Using R

More Linear Models in R

Advanced Statistical Analysis Using R

Mixed-effects Model

Advanced Data Wrangling and Plotting

Writing Loops and Functions in R

Appendix

Conclusion

Data structure

Data exploration and visualization

Further data exploration and identifying mistakes