...

/

Data Exploration and Error Checking

Data Exploration and Error Checking

Explore the data using R.

Whenever we start working with a dataset in R, we should first devote substantial time to checking it for errors. These are some questions we should ask ourselves:

  • Did the data import correctly?
  • Are the column names correct?
  • Are the types of data appropriate? (e.g., factor vs numerical)
  • Are the numbers of columns and rows appropriate?
  • Are there typos?

For example, if a column that’s supposed to be numerical shows up as a factor, that likely indicates a typo where we accidentally have text in place of a number. Remember, each column in a data frame is a vector, and vectors can only have one mode. So, a vector with both numbers and characters is treated as if it’s all characters. Similarly, if we have a factor that should have three categories but imports with four, we likely have a typo—(for example, “predator” versus “predtaor”), and the misspelled version is showing up as a separate category. These sorts of mistakes are widespread!

Because this dataset has been very thoroughly examined, these types of errors aren’t present. However, we may want to change the names of columns or remove outliers, which we’ll cover in the subsequent sections.

Data structure

We begin by examining the structure of the data frame with the str() function.

R
str(RxP)

We can see that our dataset has 2502 observations of 14 different variables, some of which are integers, some are factors, and some are numerical. The following are things to notice:

  1. Several variables are listed twice but are coded in different ways. For example, there’s a column titled Tank and one titled Tank.Unique. As stated earlier, there are 12 tanks in each of the eight blocks. The variable Tank ...