Purpose of Cleaning and Data Type Checks

Purpose of data cleaning

The purpose of data cleaning is to make sure that the data is correct. It’s rarely the case that once data is collected through the game and transferred to the server, it’s automatically ready for analysis. Often, the data is incomplete, has wrong entries, or contains outliers. Thus, it’s important to check and, when possible, correct data to prepare it for analysis.

For example, given VPAL data, each row follows a certain format with variables in certain order and type. Position and orientation variables are all expected to be integers within certain ranges, time stamps are expected to follow a specific order with increments of 0.2 seconds, and scores and health are supposed to follow certain ranges and order. To ensure that the data is correct, the data needs to go through a series of checking procedures. This can be easily done through the process of parsing and reading. When errors are encountered, we can use NAN or NA to signify an error. This process can also be done after reading and parsing content into a data table (or data frame). Below, we discuss different methods used for type checks, range checks, etc.

Data type checks

There are many ways to check data format. When parsing the data, we can check the type or restrict the data to be of a different type, and if that fails, we can generate an exception. In R, we can use different functions to check types after we read the data. For example, is.numeric is a function that will return TRUE or FALSE based on if the variable is numeric or not. In most cases, we would want to introduce an NA for cells that do not contain the correct format or type. Suppose we are looking at numeric data. In that case, we can use the as.numeric function, which checks if a variable or a column in the data frame contains numeric values (i.e., real numbers) or not, and for the cells that are not, it will introduce an NA (see code widget below). There are also functions to check other data types: as.logical, as.factor, as.character, and as.integer.

Data format checks conversions

In addition to the type issue discussed above, we may also have dirty data, meaning that the ranges or values may not be right. This is not just an issue of a quick type check but requires more involved checking on ranges, given the measurement type and actual variable. This also requires some knowledge from designers about the ranges for the different variables represented in our data.

Categorical data

For categorical data, this can be as simple as a scenario where we have a value that is not in the right category. When we have a categorical type of data that we want to constrain to be within a specific list of categories, we can enforce that constraint. In R, we use factors to denote that type, and within R, we can enforce a variable of type factor to have specific categories. If a variable shows a value that isn’t in the right categorical type, a NAN or NA will be generated in the cell.

Numeric data

For numeric data, we need to encode manual checks on values based on specified logical values or designers’ designated values per variable. For example, health cannot be negative or cannot go above 100, etc. Similar to the categorical variables, if a value is not right, a NAN or NA is introduced in that cell, and the process continues.

Time stamps

Timestamps can be represented in several ways. One way is to store simulation time as discussed above. This is, in essence, how it is represented in VPAL. However, most other games use standard time. The following lab shows examples of reading in time and date into a POSIXlt object, as with other type conversions, an `NA is used if the value cannot be converted.

Once the data is checked for type and for consistency in terms of its range and values, the data is deemed technically correct and ready to be passed onto the next step

Following conversion is for vectors.

Get hands-on with 1200+ tech skills courses.