Introduction of GLMs for Count Data

Let’s get a brief overview of GLMs for count data.

R packages

We’ll use the following R packages in this section:

  • ggplot2
  • arm

Introduction

Count data are integers—whole numbers—for example, numbers of individuals, numbers of species, numbers of times an event occurred, and so on. The starting point for a GLM analysis of count data is to use the Poisson distribution and a log link function. The log link function ensures that all predicted counts are positive by taking the exponential of the values generated by the linear predictor (antilogs are always positive values). For the Poisson distribution, the variance is equal to the mean. As always, this assumption needs to be examined, since count data won’t necessarily have this property in all cases. Count data where the Poisson distribution provides a good model usually have lots of zeros and small values. Just because we have count data, it doesn’t mean that the Poisson distribution will necessarily provide a good model. Indeed, as the mean of the Poisson distribution increases, the distribution converges towards the normal distribution.

GLMs for count data

The example data are counts of grassland plant species in relation to levels of nitrogen deposition. This data was kindly contributed by Carly Stevens. Increasing nutrient inputs to grassland usually results in a decline in their diversity. The factorial example is a controlled experiment that looks at the change in diversity following nutrient enrichment. Is the same true in surveys of grassland diversity in relation to the level of nitrogen pollution they receive? The data are in a file called Data_species_counts.txt:

Species <- read.table("Data_species_counts.txt", header = TRUE)

The data has just two columns, giving the level of nitrogen deposition (N_deposition) and the counts of the numbers of grassland plant species (Species_counts):

str(Species)

 ## 'data.frame': 74 obs. of 2 variables:
 ## $ N_deposition : num 8.56 7.7 8.28 8.14 10.99 ...
 ## $ Species_counts: int 20 17 25 18 20 10 13 14 15 15 ...

The N_deposition data are continuous and the Species_counts data are integers.

summary(Species)

 ##   N_deposition   Species_counts
 ##  Min.:7.70       Min.   : 6.00
 ##  1st Qu.:14.26   1st Qu.:10.00
 ##  Median :20.25   Median :13.00
 ##  Mean   :20.58   Mean   :13.91
 ##  3rd Qu.:27.11   3rd Qu.:15.00
 ##  Max.   :40.86   Max.   :27.00

A graph of the Species_counts versus the N_deposition level shows a clear negative relationship, so we may be tempted to use a linear regression relationship as follows:

Get hands-on with 1200+ tech skills courses.