Generalized Linear Models

Learn about generalized linear models, also known as GLMs.

Understanding non-normal data

Normal data is data where we have a relatively even spread of values above and below the mean. These values can be any actual number, positive or negative.

Most values are close to the mean, with the probability of obtaining values decreasing as we get farther away from the mean. However, in the real world, data is often not normally distributed. In these cases, we can turn to Generalized Linear Models (GLMs), which extend the modeling framework of lm() to many other error structures. “error” here refers to the way the data is spread around the mean. The primary function for running a GLM is the glm() function.

Making non-normal data normal

GLMs work via a link function, which transforms the data to a normal scale. For example, with a binomial GLM (also called a logistic regression), we aren’t modeling the data (usually a 0 or 1). Instead, we’re modeling the log odds of an event happening (getting a 1) or not happening (getting a 0). In this case, the log odds of the event occurring are normally distributed. Some of the built-in error families available with the glm() function are as follows:

  • binomial(link = “logit”)
  • gaussian(link = “identity”)
  • poisson(link = “log”)
  • Gamma(link = “inverse”)
  • inverse.gaussian(link = “1/mu^2”)
  • quasi(link = “identity”, variance = “constant”)
  • quasibinomial(link = “logit”)
  • quasipoisson(link = “log”)

Next to the previous error families, the link functions are the defaults for each error family. They’re automatically assumed, so we don’t need to write them in our model specification. Several families have multiple possible link functions, some of which may be preferred in certain circumstances, so knowing all the different possible options is a good idea. That said, we have never needed to use a link function other than the default, so we probably won’t need them here either.

Understanding error families

So, what do these different error families mean? What does the data look like, and how do we know which error function to use? Here are some basic rules for the most commonly used error distributions, including one that’s very useful and not on the previous list:

  • Binomial: This is general data that falls into one of two categories. For example, animals living or dead, eggs hatched or unhatched, and frogs that are males or females. R is pretty clever, though, and we can also use a binomial error distribution to model proportional data, so long as the proportions are bounded at 0 and 1. We’ll discuss an example of this later in the chapter.

  • Gaussian: This is just a fancy statistics way of saying “normal distribution,” named for Johann Carl Friedrich Gauss. Thus, we can use the glm() function to run the same linear models we made in the Introduction to Plotting and Basic Statistical Analysis lessons if we want.

  • Poisson: Named after the French mathematician Siméon Denis Poisson, this data count of observations tends to have a tail out to the right. Since it’s generally a count of things, it can’t be negative and can only be whole integers. One of the defining features of a Poisson distribution is that the data variance is approximately the same as the mean. Also, R will let us use a Poisson distribution with non-integer values, but it’ll give us a warning message.

  • Negative binomial: This might sound like it’s related to the binomial error distribution, and it is, but it’s easier to think of this as an “overdispersed Poisson.” If a Poisson dataset is one where the mean and variance are the same, a negative binomial dataset is one where the variance is considerably larger than the mean. In other words, the tail of the data out to the right is much longer. There are more extreme values.

    Like a Poisson, the data can’t be negative and should only consist of integers, although R will once again let us use non-integer data. Note that the negative binomial distribution has its particular function called glm.nb().

  • Gamma: A Gamma distribution is similar in shape to a negative binomial but is continuous data—that is, not just integers.

  • Lastly, we can ignore the last three distributions mentioned under Making non-normal data normal above, which are quasi, quasibinomial, and quasipoisson. We’ve included them here because we’d see them if we looked at the help file for the glm() function, but no one uses them anymore. The quasi models were shortcuts for modeling overdispersed data implemented in the 2000s, but more advanced and statistically appropriate techniques have been developed since, rendering the quasi models moot. We can also ignore the inverse Gaussian distribution for the most part, at least in our experience.

Several other specific functions have been developed for certain error structures not incorporated in the original glm() function. These include the betabin() function in the aod package that’s used for analyzing beta-binomial data and the betareg() function in the betareg package that’s used for a beta distribution of proportional data.

GLMs

Let’s create two new variables: the number of tadpoles that survived metamorphosis and the number that died before metamorphosis. We can once again use the dplyr package to accomplish this. However, instead of using summarize() on line 3 to calculate the mean() of each group of data, we can use the length() function to calculate how many of each group we have.

In the next section, we’ve calculated the length, which is the number of values in a vector, of the summarized Ind variable. We could use any of the variables there since the group_by() function on line 2 essentially subsets the data by each tank. Since we’re taking the length of the vectors of how many tadpoles are in each tank at metamorphosis, any of the variables should give us the same answer.

First, let’s load all the packages we’ll need:

Press + to interact
library(dplyr)
library(ggplot2)
library(car)
library(MASS)
Press + to interact
temp <- RxP.clean %>%
group_by(Tank.Unique) %>%
summarize(N.alive = length(Ind))
temp

Tank.Unique is the unique number of each tank, and N.alive is the total number of metamorphs from each tank. Now that these numbers are calculated, we add this column to our existing RxP.byTank data frame. There are two ways to do this. First, if we can assume that our rows in the two objects are in the exact same order, we could create a new column in RxP.byTank:

Press + to interact
# When you call a column that does not exist,
# R will make it
RxP.byTank$N.alive<-temp$N.alive

Alternatively, we could modify our code from the Introduction to Plotting chapter when we made the RxP.byTank data frame to include the command to make the new N.alive variable. Since we should still have this code in a script somewhere, it’s simple to modify it as follows:

Press + to interact
RxP.byTank<-RxP.clean %>%
group_by(Tank.Unique, Pred, Res, Hatch, Block) %>%
summarize(Age.DPO = mean(Age.DPO),
Age.FromEmergence = mean(Age.FromEmergence),
SVL.initial = mean(SVL.initial),
Tail.initial = mean(Tail.initial),
SVL.final = mean(SVL.final),
Mass.final = mean(Mass.final),
Resorb.days = mean(Resorb.days),
N.alive = length(Ind))##This line is new!!
# Remember to reorder the Pred factor
RxP.byTank$Pred<-factor(RxP.byTank$Pred, levels=c("C","NL","L"))

It might also be helpful to calculate the number of tadpoles that died or were eaten before metamorphosis. Since we started with 50 tadpoles per tank, this is very simple to calculate. We’ve introduced a new function called glimpse() in the following code on line 2. It’s essentially the dplyr version of str(). If we use the str() function on a tibble, we often get a bunch of jargon at the bottom. We find that mess annoying, so the glimpse() function offers a cleaner way to look at our data frame when it’s a tibble.

Press + to interact
RxP.byTank$N.dead<-50-RxP.byTank$N.alive
glimpse(RxP.byTank)

We have two new variables representing the number of red-eyed tree frog tadpoles in each tank that survived metamorphosis or died before getting there. Let’s look at the distribution of the mortality data. Note that in the following histogram, we’ve specified how many bins the data should be placed in for plotting:

Press + to interact
qplot(data=RxP.byTank,
x=N.dead,
geom="histogram",
bins=10)

This shows us that most of the values are relatively small, which means low mortality, but some tanks had very high mortality. There’s a long tail of data out to the right. This data is count data (whole integers). Therefore, it can’t be negative, and it fits the general shape of what we expect Poisson or negative binomial data to look like.

How do we decide what error distribution fits best? Recall from an earlier chapter that the fitdistr() function from the MASS package allows us to evaluate the fit of different error distributions to a given set of data. We can use that here to see which error distribution works best for this data. Note that we need a little preliminary knowledge to know which distributions we might want to choose from. Don’t worry; that knowledge will come in time. Here, let’s evaluate four distributions:

  • Normal
  • Lognormal
  • Poisson
  • Negative binomial

We can see all possible distributions in the help file for the fitdistr() function:

Press + to interact
#Create 4 objects for evaluating the best
#error distribution for the N.dead variable
fit1<-fitdistr(RxP.byTank$N.dead, "normal")
fit2<-fitdistr(RxP.byTank$N.dead, "lognormal")
fit3<-fitdistr(RxP.byTank$N.dead, "Poisson")
fit4<-fitdistr(RxP.byTank$N.dead, "negative binomial")
#Use AIC() to compare the fit to each distribution
AIC(fit1,fit2,fit3,fit4)

Looking at the Akaike information criteria (AIC) scores, we can see that the Poisson is by far the worst fit (recall that smaller AIC scores are better) and that the lognormal and negative binomial are pretty similar.