Binary Data and The Wells Dataset

Let’s get a brief overview of binary data and the wells dataset.

R packages

We’ll use the following R packages in this chapter:

  • ggplot2
  • arm
  • ggfortify
  • Sleuth3

Binary data

One of the most important uses of GLMs is for the analysis of binary data. Binary data are an extreme form of binomial count data where the binomial denominator is equal to one, so that every trial produces a value of either 1 or 0. Therefore, binary data can be analyzed in a similar way to binomial counts. In other words, we can use a GLM with a binomial distribution and the same choice of link functions to prevent predictions from going below zero or above values of one. However, despite the use of the same distribution and link functions, due to the constrained nature of binary data, there are some differences in the analysis of binomial counts.

For one thing, the use of the ratio of the residual deviance to residual DF to diagnose overdispersion or underdispersion doesn’t apply. Given that R’s default set of residual checking plots are also of little (if any) use when applied to a binomial GLM, this leaves us without any means for model checking with the base distribution of R. Luckily, the arm package (Gelman & Hill 2006)Gelman, A. & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press (2006). provides a graphical approach with the binnedplot() function.

An example of the wells dataset

Our example dataset for a binary GLM comes from an environmental science analysisGelman, A. & Hill, J. (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.. We won’t find the data in the arm package, but it’s available in the following code as Data_Binary_Wells:

wells <- read.table("Data_Binary_Wells.txt", header = TRUE)

The example concerns an area of Bangladesh where many wells used for drinking water have been contaminated by naturally-occurring arsenic:

Get hands-on with 1200+ tech skills courses.