Filtering Datasets

Learn to analyze data subsets using filters in the tidyverse.

We'll cover the following...

In data science work, we often need to filter or subset data. Frequently, we’ll want to analyze a subset of the data given to us based on some condition that we can check within the dataset itself. For example, in a student dataset, we want to look at average grades for students only in a particular year or a specific course. Then we’ll need to filter the data to view the relevant records only.

Using filter

Filters in the tidyverse are applied similarly to group_by statements. In the example below, we use filter to subset student grade data contained in the attached csv files. The file StudentInformation.csv contains general information regarding students, while the file GradeData-byCourse.csv contains the students’ grades (Grade) for each course (CourseID).

R
Files
#Load tidyverse libraries
library(ggplot2)
library(purrr)
library(tibble)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(stringr)
library(readr)
library(forcats)
#Load datasets directly to tibbles
VAR_StudentData <- read_csv("StudentInformation.csv",
col_names = TRUE,
skip = 0,
n_max = Inf,
show_col_types = FALSE)
VAR_GradeDataByCourse <- read_csv("GradeData-ByCourse.csv",
col_names = TRUE,
skip = 0,
n_max = Inf,
show_col_types = FALSE)
#Join the two data sets
VAR_CombinedStudentData <- VAR_StudentData %>%
full_join(VAR_GradeDataByCourse,
by = "StudentID", multiple = "all")
#Filter the combined data set to the MATH101 course
VAR_CombinedDataMath101 <- VAR_CombinedStudentData %>%
filter(CourseID == "MATH101")
#Join and filter the data sets in a single command
VAR_CombinedDataMath101Piped <- VAR_StudentData %>%
full_join(VAR_GradeDataByCourse,
by = "StudentID", multiple = "all") %>%
filter(CourseID == "MATH101")
#Output results
paste0("Multi-step combination of data")
VAR_CombinedDataMath101
paste0("Single step combination using pipe")
VAR_CombinedDataMath101Piped
  • Lines 25–31: We add a single line here, the filter command. ...