QuestionQuestion

Transcribed TextTranscribed Text

Generalized linear models allow for a number of specific error distributions other than a normal distribution. In this way, they represent a flexible generalization of general linear models. The purpose of this lab is to learn how to apply a generalized linear model that assumes a Poisson distribution of error terms, which is often the most appropriate model for count data. For this lab, we will revisit James Meidell’s Northern Alligator Lizard pitfall trap data from last week. Remember that James created 17 pitfall trap arrays to assess relative density of Northern Alligator Lizards on the east and west side of a section of I-90. He collected data on 19 explanatory habitat variables at each array of pitfall traps. He is interested in seeing what habitat variables best predict relative lizard density. As a reminder, the variables are summarized in Table 1 below. Table 1 Variables in Northern Alligator Lizard Dataset Variable Name Description Type lizards Number of lizards caught in each pitfall trap array Discrete numerical response variable location East or west side of I-90 Categorical explanatory variable 11_30cm Average percent of ground covered by rocks 11-30 cm in diameter Continuous explanatory variable 31_60cm Average percent of ground covered by rocks 31-60 cm in diameter Continuous explanatory variable totalrock Average percent of ground covered by rocks. Continuous explanatory variable woodebris Average percent of ground covered by woody debris Continuous explanatory variable leafdebris Average percent of ground covered by leafy debris Continuous explanatory variable seedling Average percent of canopy cover from seedlings Continuous explanatory variable sapling Average percent of canopy cover from saplings Continuous explanatory variable maturetrees Average percent of canopy cover from mature trees Continuous explanatory variable Your task today is to select an appropriate generalized linear model (GLM) to analyze the alligator lizard dataset and interpret the output. It will be interesting to compare your output to the results from Lab 6. In the instructions below, I point out the differences between the general linear model approach we took last week and the generalized linear model we will apply today. 1. Data exploration Last week, all of you noticed the highly skewed distribution of many of your predictor variables and decided to use a log transformation. This is still a good idea. Copy your code from last week, or else use the code below, to 1) log10+1 transform skewed predictor variables, 2) create a new column containing the squared values of woodebris, 3) log10+1 transform the response variable for visualization, and 4) bind all of these together with the untransformed predictor variables that were not skewed into a matrix called Z. Name the columns of this matrix and convert it into a data frame. Note that I have placed the response variable as the last column. This makes it easier to interpret any pairplots, since the response variable will appear on the y-axis, as we are used to seeing it: Data<-read.table("Jamesdata.txt",header=TRUE) #transform predictor variables that need it Data$L.X11_30cm <- log10(Data$X11_30cm+1) Data$L.X31_60cm <- log10(Data$X31_60cm+1) Data$L.totalrock <- log10(Data$totalrock+1) Data$L.leafdebris <- log10(Data$leafdebris+1) Data$L.seedling <- log10(Data$seedling+1) Data$L.sapling <- log10(Data$sapling+1) Data$L.snag_seed <- log10(Data$snag_seed+1) #prepare to try a quadratic regression with woodebris Data$woodebris2 <- (Data$woodebris)^2 #log transform the response variable for visualization only Data$L.lizards <- log10(Data$lizards+1) #create a new data frame called Data containing these transformations Z<-cbind(Data$location, Data$L.X11_30cm, Data$L.X31_60cm, Data$L.totalrock, Data$woodebris, Data$woodebris2, Data$L.leafdebris, Data$L.seedling, Data$L.sapling, Data$maturetrees, Data$L.lizards, Data$lizards) colnames(Z)<-c("Location", "L.X11_30cm", "L.X31_60cm", "L.totalrock", "woodebris","woodebris2","L.leafdebris", "L.seedling", "L.sapling", "maturetrees","L.lizards","lizards") DataT <- as.data.frame(Z) Now take a look at the data, just as you did last week, but this time keep the Poisson distribution in mind: source("AEDgraphingfunctions.R") pairs(Z, lower.panel=panel.smooth2,upper.panel=panel.cor, diag.panel=panel.hist) First, as many of you noticed last week, there is a high degree of collinearity between all of the rock variables. Let’s reduce that to some degree by eliminating totalrock from the analysis. This will take out the two highest correlations without loosing much information, since the two different rock size categories are maintained. Now scan the bottom row of plots. Note the classic Poisson distribution apparent in the plot of maturetrees vs. lizards, as well as the general tendency of the variance in lizards to scale with the mean. This indicates that a Poisson-distributed error structure is likely to be appropriate. Also scan the second to bottom row of plots and note how several of the stronger relationships are linearized to some extent by the log transformation of lizards. This indicates that the default log link function of Poisson regression is likely to be appropriate as well. Because the data fit what we expect from a Poisson distribution, we no longer need to worry about nonlinearity, heteroscedasticity, or non-normality. Finally, note the apparently quadratic relationship between woodebris and L.lizards. This indicates that it may be appropriate to include the woodebris^2 term in your model. 2. Model selection The lm function we have been using for general linear models cannot handle generalized linear models. Instead, we will need to use the glm function, which is part of the stats package. Calls to this function look very similar to those to lm, except that you have the option of specifying a specific non-normal distribution of error terms. Start by fitting a model that predicts raw number of lizards as a function of all explanatory variables except for totalrock. Do not include woodebris2 yet. Note that adding the family = poisson term is all you need to do to switch from the default Gaussian distribution of error terms to a Poisson distribution: fit0.P <- glm(lizards ~ Location + L.X11_30cm + L.X31_60cm + woodebris + L.leafdebris + L.seedling + L.sapling + maturetrees, family=poisson, data=DataT) Last week taught us that model selection is tricky in this case, likely due to collinearity, as well as the lack of strong univariate relationships between each explanatory variable and the response variable. Let’s use a combination of forward and backwards stepwise regression for the most thorough model selection possible. For this, you will need to install and load the MASS package. You can then use the following function: step <- stepAIC(fit0.P, direction="both") note that, at each step, remaining variables are dropped one-by-one from the model, and previously excluded variables are added one-by-one. The change that yields the greatest improvement in AIC score is retained. 3. Model validation Fit a new model that includes only the variables identified by your stepwise model selection: fit1.P <- glm(lizards ~ L.X11_30cm + L.X31_60cm + maturetrees, family=poisson, data=DataT) Now validate this model by examining the residuals: par(mfrow=c(2,2)) plot(fit1.P) These are not horrendous, but the bow shape in the left two plots is reminiscent of the nonlinear relationship we noted between woodebris and lizards. Let’s add the quadratic term into our model, repeat the selection procedure, and see if it is retained and improves the residuals: fit2.P <- glm(lizards ~ Location + L.X11_30cm + L.X31_60cm + woodebris + woodebris2 + L.leafdebris + L.seedling + L.sapling + maturetrees, family=poisson, data=DataT) step <- stepAIC(fit2.P, direction="both") This does, indeed, yield a different model that includes the quadratic term. Validate the new model by looking at the error terms: fit3.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees, family = poisson, data=DataT) par(mfrow=c(2,2)) plot(fit2.P) The bowed shape in the residuals is now gone, which is good. The raw residuals show some heteroscedasticity, with the variance increasing with the fitted value, but this is to be expected with a Poisson distribution and should not therefore cause alarm. We also don’t necessarily expect the residuals to be normally distributed, so we can pretty much ignore the Q-Q plot. The plot of residuals vs leverage identifies one potentially problematic data point, in row number 13. In the interpretation phase, we should probably try analyzing the data with and without point 13 in order to assess the robustness of our conclusions with respect to this one point. Before moving on to interpretation, however, we should also try plotting the residuals vs. any other explanatory variable (both those included and excluded from the model) to assess them for any other patterns, which might indicate the need to include additional variables or interaction terms in the model. You can use code such as this to accomplish that task: par(mfrow=c(3,2)) E <- rstandard(fit3.P) plot(y = E, x = DataT$L.X11_30cm, xlab = "L.X11_30cm", ylab = "Residuals") abline(0,0) Finally, it is important to check a Poisson regression for evidence of overdispersion (a variance greater than the mean). In order to do this, fit your model using a quasipoisson distribution and use the summary function to look at the output: fit4.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees, family = quasipoisson, data=DataT) summary(fit4.P) Near the bottom of the output, you will see a line that says “Dispersion parameter for quasipoisson family taken to be …”. This is the estimated dispersion parameter, which describes the mean-variance relationship of your residuals. Specifically, the dispersion parameter ρ is defined such that, if E[Yi] = μi, the variance of Yi is modeled as ρμi. A perfect Poisson distribution will have a dispersion parameter of 1 (meaning that the mean = the variance). A value of ρ greater than 1 indicates overdispersion. If you use a “quasipossion” model, the standard errors of your estimates will be multiplied by the square root of ρ. Overdispersion thus decreases precision and power, and wrongly ignoring overdispersion will bias your p-values downward, leading to inflated type I error rates. Overdispersion can be caused by model misspecification, and this possibility should be investigated first (see page 87 of your text). If ρ > 1, it is also possible that your data are truly overdispersed. As a rule of thumb, a ρ larger than 1.5 means that some action (such as using a “quasipoission” model) should be taken. A ρ larger than 15 or 20 indicates that you should also consider other methods, such as the negative binomial distribution or zero-inflated models (Zuur et al. 2009). In this case, the value of 1.06 revealed by the summary function indicates that there is no strong evidence for overdispersion, so it is safe to go back to using a poisson distribution of error terms with no correction. 4. Model interpretation Once you are happy that your model is valid, you may finally turn your attention to hypothesis testing and interpretation. Once again, there are several different ways to obtain p-values associated with either your overall model or your individual explanatory variables. The summary command will give parameter estimates as well as p-values from a z-test on each explanatory variable. (The z-test is used rather than a t-test because the variance does not have to be estimated separately from the mean). Be careful in interpreting the estimates of the slope due to the log-link function, however; you must remember that the specified relationship is between the linear function of explanatory variables and the mean of the log-transformed response variable: summary(fit3.P) The summary command also gives the null deviance and residual deviance values (analogous to the total sum of squares and the residual sum of squares in a general linear model). Because the difference in deviance values is approximately Chi-square distributed, these values can be used to test the significance of the overall model, although the nature of the p-value is approximate and should be interpreted with caution. Perform a deviance test on the overall model using this line of code: 1-pchisq(nulldeviance -residualdeviance , nulldf – residualdf ) For example, with these particular results, you can use the line of code: 1-pchisq(52.936-17.055, 3) A deviance test can also be used to provide p-values for each individual explanatory variable. The deviance test is preferable to the z-test for small datasets (Zuur 2010). Perform this deviance test using the following code: drop1(modeloutput , test = “Chi”) Note that, if you used a “quasipoission” distribution, you would need to use test = “F” instead. Because point number 13 was identified as being particularly influential, you can repeat your model fit without that point and assess the effect: fit5.P <- glm(lizards ~ woodebris + woodebris2 + maturetrees, family = poisson, data=DataT[-13,]) summary(fit5.P) The results do not change substantially; all variables are still highly significant, and the regression coefficients fall within 1 se of the original estimates. It is okay to forge ahead with your original dataset. The final step in model interpretation is visualization. One particularly handy command gives partial regressions of each of the variables in your model (Review section 5.3 in Zuur for an explanation of partial regressions). Try this in order to visualize the relationship between each of your explanatory variables and your response variable: termplot(modeloutput , partial.resid=TRUE, smooth=panel.smooth, col.res=1) Final visualization is particularly difficult in this case because you have two different explanatory variables in your model. We can gain some insight by visualizing just the quadratic relationship between woodebris and lizards, however. Spend a few minutes unpacking the following lines of code. The first one simply creates a scatterplot between woodebris and lizards. The second creates a vector of evenly spaced values that span the width of values for woodebris. The third line implements the model fit by using the coefficient estimates revealed by the summary command. It plots x-values (contained in vector x) vs. the predicted number of lizards. Note that the number of lizards is predicted by taking e to the power of the linear predictor. For the linear predictor, we are using an intercept that depends on the intercept of the overall model, plus an offset associated with the mean value of maturetrees. plot(DataT$woodebris, DataT$lizards, xlab="Woody Debris", ylab = "Number of Lizards") x<- c(3:35) lines(x,exp(-0.607286+0.330792*x-0.008818*x^2- 0.618493*(mean(DataT$maturetrees)))) Color code the points on the plot according to their respective value of maturetrees. This would make an excellent final visualization. Lab 7 Assignment Spend some time exploring the effect of potential interaction terms to see if you can further improve the fit of your model. In a word document contain the summary output of your best model (including the formula at the top), the drop1 output of this model, and your best attempt at a final visualization.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

#1. Data exploration
# Last week, all of you noticed the highly skewed distribution of many of your
# predictor variables and decided to use a log transformation. This is still a good idea.
# Copy your code from last week, or else use the code below, to
# 1) log10+1 transform skewed predictor variables,
# 2) create a new column containing the squared values of woodebris,
# 3) log10+1 transform the response variable for visualization, and
# 4) bind all of these together with the untransformed predictor variables
#    that were not skewed into a matrix called Z. Name the columns of this matrix
#    and convert it into a data frame. Note that I have placed the response variable
#    as the last column. This makes it easier to interpret any pairplots,
#    since the response variable will appear on the y-axis, as we are used to seeing it.

# 1)
data <- read.table("Jamesdata.csv", header=TRUE, sep=",")
data$L.X11_30cm <- log10(data$X11_30cm+1)
data$L.X31_60cm <- log10(data$X31_60cm+1)
data$L.totalrock <- log10(data$totalrock+1)
data$L.leafdebris <- log10(data$leafdebris+1)
data$L.seedling <- log10(data$seedling+1)
data$L.sapling <- log10(data$sapling+1)
data$L.maturetrees <- log10(data$maturetrees+1)


# 2)
# prepare to try a quadratic regression with woodebris
data$woodebris2 <- (data$woodebris)^2
# 3)
# log transform the response variable for visualization only
data$L.lizards <- log10(data$lizards+1)

# 4)
# create a new data.frame containing these transformations

Z<-cbind(data$location,
          data$L.X11_30cm,
          data$L.X31_60cm,
          data$L.totalrock,
          data$woodebris,
          data$woodebris2,
          data$L.leafdebris,
          data$L.seedling,
          data$L.sapling,
          data$maturetrees,
          data$L.lizards,
          data$lizards)

dim(Z)
# set names for these new variables
colnames(Z)<-c("Location",...

By purchasing this solution you'll be able to access the following files:
Solution.docx and Solution.R.

50% discount

Hours
Minutes
Seconds
$48.00 $24.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats