## Transcribed Text

Generalized linear models allow for a number of specific error distributions other
than a normal distribution. In this way, they represent a flexible generalization of general
linear models. The purpose of this lab is to learn how to apply a generalized linear model
that assumes a Poisson distribution of error terms, which is often the most appropriate
model for count data. For this lab, we will revisit James Meidell’s Northern Alligator
Lizard pitfall trap data from last week. Remember that James created 17 pitfall trap
arrays to assess relative density of Northern Alligator Lizards on the east and west side of
a section of I-90. He collected data on 19 explanatory habitat variables at each array of
pitfall traps. He is interested in seeing what habitat variables best predict relative lizard
density. As a reminder, the variables are summarized in Table 1 below.
Table 1 Variables in Northern Alligator Lizard Dataset
Variable Name Description Type
lizards Number of lizards caught in each
pitfall trap array
Discrete numerical response
variable
location East or west side of I-90 Categorical explanatory variable
11_30cm Average percent of ground covered
by rocks 11-30 cm in diameter Continuous explanatory variable
31_60cm Average percent of ground covered
by rocks 31-60 cm in diameter Continuous explanatory variable
totalrock Average percent of ground covered
by rocks. Continuous explanatory variable
woodebris Average percent of ground covered
by woody debris Continuous explanatory variable
leafdebris Average percent of ground covered
by leafy debris Continuous explanatory variable
seedling Average percent of canopy cover
from seedlings Continuous explanatory variable
sapling Average percent of canopy cover
from saplings Continuous explanatory variable
maturetrees Average percent of canopy cover
from mature trees Continuous explanatory variable
Your task today is to select an appropriate generalized linear model (GLM) to analyze the
alligator lizard dataset and interpret the output. It will be interesting to compare your
output to the results from Lab 6. In the instructions below, I point out the differences
between the general linear model approach we took last week and the generalized linear
model we will apply today.
1. Data exploration
Last week, all of you noticed the highly skewed distribution of many of your predictor
variables and decided to use a log transformation. This is still a good idea. Copy your
code from last week, or else use the code below, to 1) log10+1 transform skewed predictor
variables, 2) create a new column containing the squared values of woodebris, 3) log10+1
transform the response variable for visualization, and 4) bind all of these together with
the untransformed predictor variables that were not skewed into a matrix called Z. Name
the columns of this matrix and convert it into a data frame. Note that I have placed the
response variable as the last column. This makes it easier to interpret any pairplots, since
the response variable will appear on the y-axis, as we are used to seeing it:
Data<-read.table("Jamesdata.txt",header=TRUE)
#transform predictor variables that need it
Data$L.X11_30cm <- log10(Data$X11_30cm+1)
Data$L.X31_60cm <- log10(Data$X31_60cm+1)
Data$L.totalrock <- log10(Data$totalrock+1)
Data$L.leafdebris <- log10(Data$leafdebris+1)
Data$L.seedling <- log10(Data$seedling+1)
Data$L.sapling <- log10(Data$sapling+1)
Data$L.snag_seed <- log10(Data$snag_seed+1)
#prepare to try a quadratic regression with woodebris
Data$woodebris2 <- (Data$woodebris)^2
#log transform the response variable for visualization only
Data$L.lizards <- log10(Data$lizards+1)
#create a new data frame called Data containing these transformations
Z<-cbind(Data$location, Data$L.X11_30cm, Data$L.X31_60cm,
Data$L.totalrock, Data$woodebris, Data$woodebris2,
Data$L.leafdebris, Data$L.seedling, Data$L.sapling,
Data$maturetrees, Data$L.lizards, Data$lizards)
colnames(Z)<-c("Location", "L.X11_30cm", "L.X31_60cm",
"L.totalrock", "woodebris","woodebris2","L.leafdebris",
"L.seedling", "L.sapling",
"maturetrees","L.lizards","lizards")
DataT <- as.data.frame(Z)
Now take a look at the data, just as you did last week, but this time keep the Poisson
distribution in mind:
source("AEDgraphingfunctions.R")
pairs(Z, lower.panel=panel.smooth2,upper.panel=panel.cor,
diag.panel=panel.hist)
First, as many of you noticed last week, there is a high degree of collinearity between all
of the rock variables. Let’s reduce that to some degree by eliminating totalrock from the
analysis. This will take out the two highest correlations without loosing much
information, since the two different rock size categories are maintained. Now scan the
bottom row of plots. Note the classic Poisson distribution apparent in the plot of
maturetrees vs. lizards, as well as the general tendency of the variance in lizards to scale
with the mean. This indicates that a Poisson-distributed error structure is likely to be
appropriate. Also scan the second to bottom row of plots and note how several of the
stronger relationships are linearized to some extent by the log transformation of lizards.
This indicates that the default log link function of Poisson regression is likely to be
appropriate as well. Because the data fit what we expect from a Poisson distribution, we
no longer need to worry about nonlinearity, heteroscedasticity, or non-normality.
Finally, note the apparently quadratic relationship between woodebris and L.lizards. This
indicates that it may be appropriate to include the woodebris^2 term in your model.
2. Model selection
The lm function we have been using for general linear models cannot handle generalized
linear models. Instead, we will need to use the glm function, which is part of the stats
package. Calls to this function look very similar to those to lm, except that you have the
option of specifying a specific non-normal distribution of error terms. Start by fitting a
model that predicts raw number of lizards as a function of all explanatory variables
except for totalrock. Do not include woodebris2 yet. Note that adding the family =
poisson term is all you need to do to switch from the default Gaussian distribution of
error terms to a Poisson distribution:
fit0.P <- glm(lizards ~ Location + L.X11_30cm + L.X31_60cm +
woodebris + L.leafdebris + L.seedling + L.sapling +
maturetrees, family=poisson, data=DataT)
Last week taught us that model selection is tricky in this case, likely due to collinearity,
as well as the lack of strong univariate relationships between each explanatory variable
and the response variable. Let’s use a combination of forward and backwards stepwise
regression for the most thorough model selection possible. For this, you will need to
install and load the MASS package. You can then use the following function:
step <- stepAIC(fit0.P, direction="both")
note that, at each step, remaining variables are dropped one-by-one from the model, and
previously excluded variables are added one-by-one. The change that yields the greatest
improvement in AIC score is retained.
3. Model validation
Fit a new model that includes only the variables identified by your stepwise model
selection:
fit1.P <- glm(lizards ~ L.X11_30cm + L.X31_60cm + maturetrees,
family=poisson, data=DataT)
Now validate this model by examining the residuals:
par(mfrow=c(2,2))
plot(fit1.P)
These are not horrendous, but the bow shape in the left two plots is reminiscent of the
nonlinear relationship we noted between woodebris and lizards. Let’s add the quadratic
term into our model, repeat the selection procedure, and see if it is retained and improves
the residuals:
fit2.P <- glm(lizards ~ Location + L.X11_30cm + L.X31_60cm +
woodebris + woodebris2 + L.leafdebris + L.seedling + L.sapling
+ maturetrees, family=poisson, data=DataT)
step <- stepAIC(fit2.P, direction="both")
This does, indeed, yield a different model that includes the quadratic term. Validate the
new model by looking at the error terms:
fit3.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees,
family = poisson, data=DataT)
par(mfrow=c(2,2))
plot(fit2.P)
The bowed shape in the residuals is now gone, which is good. The raw residuals show
some heteroscedasticity, with the variance increasing with the fitted value, but this is to
be expected with a Poisson distribution and should not therefore cause alarm. We also
don’t necessarily expect the residuals to be normally distributed, so we can pretty much
ignore the Q-Q plot. The plot of residuals vs leverage identifies one potentially
problematic data point, in row number 13. In the interpretation phase, we should
probably try analyzing the data with and without point 13 in order to assess the
robustness of our conclusions with respect to this one point. Before moving on to
interpretation, however, we should also try plotting the residuals vs. any other
explanatory variable (both those included and excluded from the model) to assess them
for any other patterns, which might indicate the need to include additional variables or
interaction terms in the model. You can use code such as this to accomplish that task:
par(mfrow=c(3,2))
E <- rstandard(fit3.P)
plot(y = E, x = DataT$L.X11_30cm, xlab = "L.X11_30cm", ylab =
"Residuals")
abline(0,0)
Finally, it is important to check a Poisson regression for evidence of overdispersion (a
variance greater than the mean). In order to do this, fit your model using a quasipoisson
distribution and use the summary function to look at the output:
fit4.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees,
family = quasipoisson, data=DataT)
summary(fit4.P)
Near the bottom of the output, you will see a line that says “Dispersion parameter for
quasipoisson family taken to be …”. This is the estimated dispersion parameter, which
describes the mean-variance relationship of your residuals. Specifically, the dispersion
parameter ρ is defined such that, if E[Yi] = μi, the variance of Yi is modeled as ρμi. A
perfect Poisson distribution will have a dispersion parameter of 1 (meaning that the mean
= the variance). A value of ρ greater than 1 indicates overdispersion. If you use a
“quasipossion” model, the standard errors of your estimates will be multiplied by the
square root of ρ. Overdispersion thus decreases precision and power, and wrongly
ignoring overdispersion will bias your p-values downward, leading to inflated type I error
rates. Overdispersion can be caused by model misspecification, and this possibility
should be investigated first (see page 87 of your text). If ρ > 1, it is also possible that
your data are truly overdispersed. As a rule of thumb, a ρ larger than 1.5 means that
some action (such as using a “quasipoission” model) should be taken. A ρ larger than 15
or 20 indicates that you should also consider other methods, such as the negative
binomial distribution or zero-inflated models (Zuur et al. 2009). In this case, the value of
1.06 revealed by the summary function indicates that there is no strong evidence for
overdispersion, so it is safe to go back to using a poisson distribution of error terms with
no correction.
4. Model interpretation
Once you are happy that your model is valid, you may finally turn your attention to
hypothesis testing and interpretation. Once again, there are several different ways to
obtain p-values associated with either your overall model or your individual explanatory
variables. The summary command will give parameter estimates as well as p-values
from a z-test on each explanatory variable. (The z-test is used rather than a t-test because
the variance does not have to be estimated separately from the mean). Be careful in
interpreting the estimates of the slope due to the log-link function, however; you must
remember that the specified relationship is between the linear function of explanatory
variables and the mean of the log-transformed response variable:
summary(fit3.P)
The summary command also gives the null deviance and residual deviance values
(analogous to the total sum of squares and the residual sum of squares in a general linear
model). Because the difference in deviance values is approximately Chi-square
distributed, these values can be used to test the significance of the overall model,
although the nature of the p-value is approximate and should be interpreted with caution.
Perform a deviance test on the overall model using this line of code:
1-pchisq(nulldeviance -residualdeviance , nulldf – residualdf )
For example, with these particular results, you can use the line of code:
1-pchisq(52.936-17.055, 3)
A deviance test can also be used to provide p-values for each individual explanatory
variable. The deviance test is preferable to the z-test for small datasets (Zuur 2010).
Perform this deviance test using the following code:
drop1(modeloutput , test = “Chi”)
Note that, if you used a “quasipoission” distribution, you would need to use test = “F”
instead.
Because point number 13 was identified as being particularly influential, you can repeat
your model fit without that point and assess the effect:
fit5.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees,
family = poisson, data=DataT[-13,])
summary(fit5.P)
The results do not change substantially; all variables are still highly significant, and the
regression coefficients fall within 1 se of the original estimates. It is okay to forge ahead
with your original dataset.
The final step in model interpretation is visualization. One particularly handy command
gives partial regressions of each of the variables in your model (Review section 5.3 in
Zuur for an explanation of partial regressions). Try this in order to visualize the
relationship between each of your explanatory variables and your response variable:
termplot(modeloutput , partial.resid=TRUE, smooth=panel.smooth,
col.res=1)
Final visualization is particularly difficult in this case because you have two different
explanatory variables in your model. We can gain some insight by visualizing just the
quadratic relationship between woodebris and lizards, however. Spend a few minutes
unpacking the following lines of code. The first one simply creates a scatterplot between
woodebris and lizards. The second creates a vector of evenly spaced values that span the
width of values for woodebris. The third line implements the model fit by using the
coefficient estimates revealed by the summary command. It plots x-values (contained in
vector x) vs. the predicted number of lizards. Note that the number of lizards is predicted
by taking e to the power of the linear predictor. For the linear predictor, we are using an
intercept that depends on the intercept of the overall model, plus an offset associated with
the mean value of maturetrees.
plot(DataT$woodebris, DataT$lizards, xlab="Woody Debris", ylab =
"Number of Lizards")
x<- c(3:35)
lines(x,exp(-0.607286+0.330792*x-0.008818*x^2-
0.618493*(mean(DataT$maturetrees))))
Color code the points on the plot according to their respective value of maturetrees.
This would make an excellent final visualization.
Lab 7 Assignment
Spend some time exploring the effect of potential interaction terms to see if you can
further improve the fit of your model. In a word document contain the summary output
of your best model (including the formula at the top), the drop1 output of this model,
and your best attempt at a final visualization.

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.

#1. Data exploration

# Last week, all of you noticed the highly skewed distribution of many of your

# predictor variables and decided to use a log transformation. This is still a good idea.

# Copy your code from last week, or else use the code below, to

# 1) log10+1 transform skewed predictor variables,

# 2) create a new column containing the squared values of woodebris,

# 3) log10+1 transform the response variable for visualization, and

# 4) bind all of these together with the untransformed predictor variables

# that were not skewed into a matrix called Z. Name the columns of this matrix

# and convert it into a data frame. Note that I have placed the response variable

# as the last column. This makes it easier to interpret any pairplots,

# since the response variable will appear on the y-axis, as we are used to seeing it.

# 1)

data <- read.table("Jamesdata.csv", header=TRUE, sep=",")

data$L.X11_30cm <- log10(data$X11_30cm+1)

data$L.X31_60cm <- log10(data$X31_60cm+1)

data$L.totalrock <- log10(data$totalrock+1)

data$L.leafdebris <- log10(data$leafdebris+1)

data$L.seedling <- log10(data$seedling+1)

data$L.sapling <- log10(data$sapling+1)

data$L.maturetrees <- log10(data$maturetrees+1)

# 2)

# prepare to try a quadratic regression with woodebris

data$woodebris2 <- (data$woodebris)^2

# 3)

# log transform the response variable for visualization only

data$L.lizards <- log10(data$lizards+1)

# 4)

# create a new data.frame containing these transformations

Z<-cbind(data$location,

data$L.X11_30cm,

data$L.X31_60cm,

data$L.totalrock,

data$woodebris,

data$woodebris2,

data$L.leafdebris,

data$L.seedling,

data$L.sapling,

data$maturetrees,

data$L.lizards,

data$lizards)

dim(Z)

# set names for these new variables

colnames(Z)<-c("Location",...