## Transcribed Text

Multiple Regression
Many problems in science involve the analysis of multi-variable data sets. For data sets in which there is a single continuous dependent variable, but several continuous and potentially discrete independent variables, multiple regression is used. Multiple regression is a method of fitting linear models of the form:
π=π½+π½π+β―+π½π+π whereππ~ππππ(0,π2). π 0 1 1π π ππ π
ππ is the response of subject i.
π½0 is the y-intercept of the model or one of the model's coefficients
π½π are the model coefficients or weights that relate the ππ to the experimental treatments
π are the values of the measured independent variables or codes that are used to ππ
or explanatory variables
classify a subject as to group membership. For example, 1 for receiving medicine and 0
π are random errors or residuals that arise from the deviation of the observed values of π
for placebo might be the codes we use in the two group design.
the responses from the model's prediction.
The values of the regression coefficients are determined by minimizing the sum
of squares of the residuals (eοΏ½rrors), i.e, minimizing
πππ οΏ½π2=οΏ½οΏ½πβποΏ½2=οΏ½(πβ[π½+π½π +π½π +β―+π½π])2
1 1π 2 2π π ππ Where ποΏ½π is the predicted value from the model associated with observation ππ.
1=1 π π=1 π π π=1 π 0
Hypothesis tests about the regression coefficients or about the contribution of particular terms or groups of terms to the fit of the model are then performed to determine the utility of the model. In many studies, regression programs are used to generate a series of models; the "best" of these models is then chosen on the basis of a variety of criteria. These models can be generated using algorithms in which the addition of variables to the model are done in a stepwise fashion, or by examining a large number or even all possible regression models given the data.
Stepwise Regression
Stepwise regression builds a model by adding or removing variables one at a time. They have lost much of their popularity because it has been shown that they are not guaranteed to select the βbest model.β
In a forward stepwise regression new independent variables are added to the model if they meet a set significance criteria for inclusion (often p< 0.05, for the partial - F test for the inclusion of the term in the model). The variable with the lowest p-value added to the model at each step and the algorithm stops when no new variable meets the significance criterion.
In a backwards stepwise regression all independent variables are initially entered into the model. They are then sequentially removed if they do not meet a set significance criterion for retention (often p>0.1 or p>0.05, for the partial F - test for removal of a term). The variable with the highest p-value is removed from the model at each step until no additional variable meeting the criterion remains.
Stepwise regression uses both these techniques with variables added or removed on each step of the process. A variable is entered if it meets the p - value to enter. After each variable is added to the equation all other variables in the equation are tested against the p - value to remove a term, and if necessary a variable might be removed from the model.
The SPSS, SAS, MINITAB, SYSTAT, BMDP and other statistical packages include these routines. The computer output generated by these routines consists of a series of models for estimating the value of Y and the goodness-of-fit statistics for each model. Each model estimates the value of Y as a linear combination of values of the predictor variables included in that model. In R, no stepwise regression module using a partial F test is available in the base installation or as a user contributed package. R does contain a stepwise function called βstepAICβ that uses Akaikeβs Information Criteria as a basis for stepwise model selection. We will avoid this information theoretic approach to model selection for now. However, a former student (Joe Hill Warren) wrote a function called βStepFβ which we can use to examine the behavior of these algorithms. However, StepF does not perform the stepwise algorithm, only the forward and backwards selection algorithms.
Later in this Lab we will address issues in building and assessing regression models. We could at this point use a number of techniques to examine our data before beginning the process of model selection, or we could use those same techniques after developing a set of candidate models to assess. In this demonstration I will take the later approach, postponing a detailed assessment of whether a model meets the assumptions of regression until later.
To demonstrate multiple regression we will examine data on the specie richness of plants in the Galapagos Islands. The data file Galapagos-plants.txt contains species
1
richness and the number of endemic species for plants on 29 islands along with data about the physical characteristics of the islands (island name, island area, maximum elevation, distance to nearest island, area of nearest island, distance to Santa Cruz island, and the number of botanical collecting trips to each island).
# read in data file on Galapagos plants
dat=read.table("K:/Biometry/Biometry-Fall-2015/Lab10/galapagos-plants.txt", h eader=TRUE)
head(dat)
Coll Endm 10 23 6 21 1 3 4 9 1 1 611
## Isla
## 1 Balt
## 2 Bart
## 3 Cald
##4Cham
##5Coam 2 0.05 10 1.9 1.9903.82 ##6Daph180.34108.08.0903.82
Spec Area Elev DisN DisS AreA 57 25.09 150 0.6 0.6 903.82 31 1.24 109 0.6 26.3 572.33
3 0.21 114 2.8 58.7 170.92 25 0.10 46 1.947.4 0.18
It is traditional to examine the relationship between the log(number of species) and log(area), so I will create these variables and a new data.frame to hold them along with the other original variables for the analysis.
# create variable to be used in regression and put in new data.frame
logarea=log(dat$Area)
logspec=log(dat$Spec)
elev=dat$Elev
diss=dat$DisS
disn=dat$DisN
coll=dat$Coll
areA=dat$AreA dd=data.frame(logspec,logarea,elev,diss,disn,areA,coll) head(dd)
## logspec
## 1 4.0430513
## 2 3.4339872
## 3 1.0986123
## 4 3.2188758
## 5 0.6931472
## 6 2.8903718
logarea
3.2224694
0.2151114
-1.5606477
-2.3025851
-2.9957323
-1.0788097
elev diss disn
150 0.6 0.6
109 26.3 0.6
114 58.7 2.8
46 47.4 1.9
10 1.9 1.9
10 8.0 8.0
areA coll
903.82 10
572.33 6
170.92 1
0.18 4
903.82 1
903.82 6
As an initial diagnostic step, I obtain the correlation matrix of the variables. I set the options(digits=4) to control how many digits are printed so the matrix will not wrap- around.
# obtain correlation matrix of variables
2
options(digits=4)
cor(dd)
## logspec logarea
## logspec 1.00000 0.89060
## logarea 0.89060 1.00000
elev
0.673628
0.817091
1.000000
0.006814
0.008184
0.253683
0.689866
diss
-0.084227
0.056801
0.006814
1.000000
0.614391
-0.106900
-0.132707
disn
0.150027
0.247098
0.008184
0.614391
1.000000
-0.172201
0.089000
areA
-0.01344
0.06126
0.25368
-0.10690
-0.17220
1.00000
-0.14251
coll
0.8522
0.8294
0.6899
-0.1327
0.0890
-0.1425
1.0000
## elev
## diss
## disn
## areA
## coll
0.67363 0.81709
-0.08423 0.05680
0.15003 0.24710
-0.01344 0.06126
0.85217 0.82937
Note that we can already see that logspec is strongly associated with logarea and coll (the number of collecting trips), and less strongly associated with elev so we might expect these to be the variables that are entered into the regression models. .
Now we will source the StepF.R code file. Note that in Rmarkdown you need to give the full path and name of the file to be sourced.
Now letβs use the βforwardβ stepwise approach to select a model. Note that the output is a multi-step process. At each step of the process, partial F - tests are reported that test if the reduction in the residual sums of squares associated with adding each variable to the model individually would be a statistically significant reduction in the residual sums of squares. The variable that causes the greatest reduction in the residual sums of squares will be added to the model. Note on iteration 1, that with the grand mean in the model the RSS (residual sums of squares) is 70.6, but that if logarea is added to the model the RSS will be reduced to 14.6. Logarea also have the smallest p β value, so logarea will be added to the model first. On iteration 2 after logarea is added to the model, we see that only addition of coll to the model will result in a statistically significant reduction in the RSS at the Ξ± = 0.05 (p = 0.017). Therefore, coll will be added to the model. However, on iteration 3 none of the remaining variables have p < 0.05, so the algorithm stops after adding logarea and coll to the model.
source("K:/Biometry/Biometry-Fall-2015/Lab10/StepF.R")
# perform forward stepwise regression
mod.7=StepF(dataTable=dd,response="logspec", level=0.05, direction="forward")
## ==================== Iteration #1 ==================== ## Single term additions
##
## Model:
## logspec ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none>
## logarea 1
## elev 1
70.6 27.81
56.0 14.6 -15.89 103.54 9.8e-11 ***
32.0 38.6 12.27 22.43 6.2e-05 ***
3
## diss 1 0.5 70.1 29.60 0.19 0.66
## disn 1
## areA 1
## coll 1
## ---
1.6 69.0 29.15 0.62 0.44
0.0 70.6 29.80 0.00 0.94
51.3 19.3 -7.76 71.61 4.5e-09 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## Variable with lowest p-value: logarea 9.76507623932042e-11
## Updating model formula: .~. +logarea
##
## ==================== Iteration ## Single term additions
##
## Model:
## logspec ~ logarea
#2 ====================
##
## <none>
## elev
## diss
## disn
## areA
## coll
## ---
## Signif. ##
## Variable with lowest p-value: coll 0.0171409159531924 ## Updating model formula: .~. +coll
##
## ==================== Iteration ## Single term additions
##
## Model:
## logspec ~ logarea + coll
F value Pr(>F)
## Df
## <none>
## elev 1
## diss 1
## disn 1
## areA 1
## ========== No ## Final Model:
Df Sum of Sq RSS AIC 14.6 -15.9 1 0.621 14.0 -15.2 1 1.287 13.3 -16.6 1 0.369 14.2 -14.6 1 0.328 14.3 -14.6 1 2.916 11.7 -20.4
F value Pr(>F)
Sum of Sq RSS
11.7 -20.4
AIC
1.15 0.292
2.51 0.125
0.67 0.419
0.60 0.447
6.49 0.017 *
codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#3 ====================
0.728 11.0 -20.2
0.381 11.3 -19.3
0.061 11.6 -18.5
0.000 11.7 -18.4
further variables significant at 0.05 ==========
1.66 0.21
0.84 0.37
0.13 0.72
0.00 0.98
We can use the βbackwardsβ stepwise approach as well. This algorithm will not always converge on the same model as the forward approach, but in this instance it does. Again, in the backwards approach all variables are initially put into the model and those that cause the smallest increase in the RSS are sequentially removed from the model. It takes a couple more iterations than the forwards approach, but converges on the same βbest model.β
4
# perform backwards stepwise regression
mod.8=StepF(dataTable=dd,response="logspec",level=0.05, direction="backward")
## ==================== Iteration #1 ==================== ## Single term deletions
##
## Model:
## logspec
##
## <none>
## logarea 1
## elev 1
## diss 1
## disn 1
## areA 1
## coll 1
## ---
~ logarea + elev + diss + disn + areA + coll Df Sum of Sq RSS AIC F value Pr(>F)
10.4 -15.69
6.94 17.4 -2.87
0.85 11.3 -15.40
0.12 10.5 -17.34
0.04 10.5 -17.57
0.05 10.5 -17.55
1.90 12.3 -12.82
14.66 0.00091 ***
1.80 0.19312
0.26 0.61364
0.09 0.76936
0.10 0.75309
4.02 0.05749 .
## Signif. codes:
##
## Variable with highest p-value: disn 0.76935945292974 ## formula: .~. -disn
##
## ==================== Iteration #2 ==================== ## Single term deletions
##
## Model:
## logspec
##
## <none>
## logarea 1
## elev 1
## diss 1
## areA 1
## coll 1
## ---
## Signif. codes:
##
## Variable with highest p-value: areA 0.739236282590088 ## formula: .~. -areA
##
## ==================== Iteration #3 ==================== ## Single term deletions
##
## Model:
## logspec ~ logarea + elev + diss
## Df Sum of Sq RSS AIC
## <none> 10.5 -19.4
## logarea 1 7.82 18.3 -5.3
## elev 1 0.80 11.3 -19.3
## diss 1 0.45 11.0 -20.2
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
~ logarea + elev + diss + areA + coll
Df Sum of Sq RSS AIC F value Pr(>F)
10.5 -17.57
7.59 18.1 -3.75
0.83 11.3 -17.36
0.35 10.8 -18.61
0.05 10.5 -19.43
1.92 12.4 -14.68
16.69 0.00046 ***
1.82 0.19039
0.77 0.38880
0.11 0.73924
4.22 0.05140 .
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+ coll
F value Pr(>F)
17.85 0.0003 ***
1.82 0.1904
1.02 0.3219
5
## coll 1 2.05 12.6 -16.3 4.68 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## Variable with highest p-value: diss 0.321891770767111 ## formula: .~. -diss
##
## ==================== Iteration #4 ==================== ## Single term deletions
##
## Model:
## logspec ~ logarea + elev + coll
## DfSumofSqRSS AIC
## <none>
## logarea 1 7.38
## elev 1 0.73
## coll 1 3.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## Variable with highest p-value: elev 0.209259986113194 ## formula: .~. -elev
##
## ==================== Iteration #5 ==================== ## Single term deletions
##
## Model:
## logspec ~ logarea + coll
## DfSumofSqRSS AIC F value Pr(>F)
11.0 -20.22
18.3 -7.28
11.7 -20.35
14.0 -15.15
16.84 0.00038 ***
1.66 0.20926
6.89 0.01454 *
## <none>
## logarea 1
## coll 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ========== All variables significant at 0.05 ==========
## Final Model:
F value Pr(>F)
11.7 -20.35
7.64 19.3 -7.76 17.00 0.00034 ***
2.92 14.6 -15.89 6.49 0.01714 *
There are other ways in which the StepF function can be used. One approach is to build a model that contains variables you wish to βforceβ into the model, and then check to see if any other variables will be added after those forced in. For example, what if we wanted to force coll to be in the model, and wanted to know if other variables explain residual variation in logspec after accounting for coll. We could think of coll as a nuisance variable that measures the differences in sampling effort across the islands. Perhaps we want to know if after accounting for the variable sampling effort which other variables are still useful in explaining variation among the islands in the species richness of plants.
To do this, we first build a linear model with coll only and save it in a model object. Then we call StepF specifying the model object name as our initial mode, and
6
then the βscopeβ argument with coll and any other variable we wish to assess. Note that this process indicates that even after accounting for the variability in sampling effort among islands that logarea still explains residuals variation in logspec.
# to determine if any variables would be added to a model with only coll as t
he predictor variable
# first build linear model with coll as the only predictor variable
lm1=lm(formula=logspec~coll)
# then use StepF specifying the model with coll and a "scope"" argument listi
ng coll and the other candidate variables
StepF(model=lm1,scope=formula( ~ coll+logarea+elev+disn+diss+areA),level=0.05 , direction="forward")
## ==================== Iteration #1 ==================== ## Single term additions
##
## Model:
## logspec ~ coll
## DfSumofSq RSS AIC
F value Pr(>F)
## <none>
## logarea 1
## elev 1
## disn 1
## diss 1
## areA 1
## ---
## Signif. codes:
##
## Variable with lowest p-value: logarea 0.000338442216827803
## Updating model formula: .~. +logarea ##
## ==================== Iteration ## Single term additions
##
## Model:
## logspec ~ coll + logarea
F value Pr(>F)
1.66 0.21
0.13 0.72
0.84 0.37
0.00 0.98
further variables significant at 0.05 ==========
## Df
## <none>
## elev 1
## disn 1
## diss 1
## areA 1
## ========== No ## Final Model:
19.3 -7.76
7.64 11.7 -20.35
0.99 18.3 -7.28
0.39 18.9 -6.35
0.06 19.3 -5.85
0.84 18.5 -7.05
17.00 0.00034 ***
1.40 0.24676
0.54 0.46998
0.08 0.77851
1.18 0.28692
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sum of Sq RSS AIC 11.7 -20.4 0.728 11.0 -20.2 0.061 11.6 -18.5 0.381 11.3 -19.3 0.000 11.7 -18.4
#2 ====================
7
##
## Call:
## lm(formula = logspec ~ coll + logarea) ##
## Coefficients:
## (Intercept) coll logarea ## 2.4490 0.0593 0.2691
The StepF pdf file explains other ways in which you might use the StepF function.
Note that the StepF function does not report the regression coefficients nor does it compute the residuals and other fit statistics for the final model. After using StepF to select models, one then must use the lm function to fit the selected models and evaluate their adequacy.
Best Subsets Regression
An alternative approach to model selection is to compute all possible regressions given a set of candidate explanatory variables, or at least the best subset of models for various levels of model complexity. By model complexity, I mean the number predictor variables included in the model.
To calculate the number of possible models with 6 predictor variables, we need to compute the number of permutations of 6 variables taken 1, 2, 3, 4, 5, or 6 at a time and add them up. A permutation is like a combination except that we consider the case AB different from the case BA. To calculate the number of permutations of n things k at a time:
πππ= π!
(π β π)!
The calculation of the number of permutations is similar to the calculation of the number of combinations of n things k at a time except that it lacks a factor of k! from the denominator. R code to calculate the number of possible models is given below. Do you want to generate, fit and asses all possible 1237 models? Remember that in regression we use Type I sums of squares, so in models with different orderings of the variables the individual variables may explain different amounts of variation in the response variable.
# calculate the number of permutations of 6 variables for models with 1 to 6
predictor variables
perm=rep(0,5)
n=6
for (k in 1:n-1){ perm[k]=factorial(n)/factorial(n-k)
8
} perm
## [1] 6 30 120 360 720
totperm=sum(perm)+1 totperm
## [1] 1237
Rather than tackling the daunting task of examining 1237 models, we will use the regsubsets function from the package leaps to select the best k models with 1 predictor, 2 predictors, etc.
In leaps we will use the regsubsets function and generate the 3 best models for each level of complexity. You could choose to do more, but the graphical display of the results becomes problematic with large subset sizes. Running the regsubsets function requires the model formula and the specification of the subset size. Performing a summary of the regsubsets object results in a tabulation of the models ranked in order of best fit. An β*β indicates that the variable is included in the model.
# load package leaps
library(leaps)
# to get k best regression models for each size
k=3 mm=regsubsets(logspec~logarea+elev+disn+diss+areA+coll,data=dd,nbest=k) summary(mm)
## Subset selection object
## Call: regsubsets.formula(logspec ~ logarea + elev + disn + diss + areA + ## coll, data = dd, nbest = k)
## 6 Variables (and intercept)
## Forced in Forced out
## logarea FALSE
## elev
## disn
## diss
## areA
## coll
## 3 subsets of each size up to 6
## Selection Algorithm: exhaustive
## logarea elev disn diss areA coll
##1 (1)"*" ##1 (2)"" ##1 (3)"" ##2 (1)"*" ##2 (2)"*" ##2 (3)"*" ##3 (1)"*" ##3 (2)"*"
"""""""""" """""""""*" "*""""""""" """""""""*" """""*""""" "*""""""""" "*""""""""*" """""*""""*"
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
9
##3 (3)"*" "" "*" "" "" "*"
##4 (1)"*" ##4 (2)"*" ##4 (3)"*" ##5 (1)"*" ##5 (2)"*" ##5 (3)"*" ##6 (1)"*"
"*" "" "*" "" "*" "*" "*" "" "" "*" "*" "" "" "*" "*" "*""""*""*""*" "*""*""*""""*" "*""*""""*""*" "*" "*" "*" "*" "*"
Although not printed as part of the summary, the summary of the regsubset object contains much more information including the R2 values for each model. I demonstrate below one way to extract those R2 values from the summary. You can learn more about the information in the summary by using the str() on the summary.
# to get rsq values from regsubsets object
nummods=k*(n-1)+1 m=rep(0,nummods) a=summary(mm)
for (i in 1:16){ m[i]=summary(mm)[[2]][[i]] }
m
## [1] 0.7932 0.7262 0.4538 0.8345 0.8114 0.8020 0.8448 0.8399 0.8353 0.8511 ## [11] 0.8495 0.8469 0.8518 0.8518 0.8507 0.8524
There are other options to graphically display the results from regsubsets. For example, one kind of plot shows the model R2 on the y-axis and which variables are in the model by colors on the x β axis. White would indicate that the variable is not in the model, while shading indicates the variable is in the model. Variables with mostly white contribute to few models while variables that are mostly black contribute to many models. In our example, as you might expect logarea, coll, and elev are mostly black while the other variables are mostly white.
# to plot a summary of which variables are in each model ranked by r2
plot(mm,scale="r2") plot(mm,scale="adjr2")
10
A similar plot can be generated for other goodness-of-fit statistics such as the adjusted R2. Remember π
2 = 1 β (ππ βππ ) and the adjusted R2 which penalizes the
πππππ π‘ππ‘ππ
modelforitscomplexityisππππ’π π‘πππ
2 =1βοΏ½πππππππβππππππποΏ½. πππ‘ππ‘ππ βπππ‘ππ‘ππ
Finally, there is also another graphical display of the results available in the car package. The function βsubsetsβ in car will plot the results of a call to regsubsets from leaps.
# plot a summary of results
library(car)
subsets(mm,statistic = "rsq", abbrev=1, legend=TRUE, cex=0.7)
11
## Abbreviation
## logarea l ## elev e ## disn dsn ## diss dss ## areA a ## coll c
This summary is similar to the previous one, but labels the models with the included variables. For models with many variables or for larger subset sizes the labels overlap making the plot unreadable.
You can also use the regsubsets function with the argument βmethodβ specified as βforwardβ, βbackwardβ, or βseqrepβ for sequential replacement rather than the default which is βexhaustiveβ.
At this point, after using one of the stepwise algorithms or an exhaustive search, one has a set of candidate models to examine in more detail. Assessing model fit involves all the same procedures used in bivariate regression since the same assumptions apply. The dependent variable should be normally distributed, scatter plots should indicate linear relationships between the dependent and independent variables, and residual plots should show homoscedasticity (equality of variances in the residuals throughout the regression line). In addition to these issues, one also needs to check for outliers or overly influential data points, and for high inter-correlations between pairs of independent variables (called multi-colinearity). If two independent variables are highly
12
correlated (r>0.9), then inclusion of both variables in the model causes problems in parameter estimation. You can pre-screen your independent variables by getting a correlation matrix prior to performing the regression and only allowing one variable of a pair of high correlated variables to serve as a candidate variable for model building at a time.
Remember the tools outlined in Lab 9 for assessing model fit are also applicable to multiple regression models. The norm function from Quantpsyc, plot(modelobj) and plot(data.frame) can provide much useful diagnostic information. Other diagnostic procedures are available in the car package.
Advice on Building and Assessing Regression Models Building
1. Choosethesetofcandidatepredictorvariablestopotentiallybeincludedinthe model.
2. Examinethedistributionoftheresponsevariabletodetermineifitmeetsthe assumption of normality. Transform if necessary.
3. Examinescatterplotsoftherelationshipsbetweentheresponsevariableyandthe predictor or independent variables x to determine if the relationships are linear. Potentially transform either x or y or, both to achieve linearity.
4. Examinethecorrelationsbetweenthepredictorvariables.Highcorrelations(values of r >> 0.9) might suggest linear dependencies among the predictor variables which can make the estimates of the regression coefficients unstable and inflate the variance of the estimates. Consider deleting members of these pairs of variables since they are essentially redundant.
5. Choosethealgorithmicapproachtofittingamodel.Inblocks(chunkwise),byforcing entry of variables into the model in a particular sequence, by backwards elimination, or forwards addition of variable to the model, etc.
6. Decideonthecriteriayouwilluseforretainingvariablesinthemodel(significant partial t or F statistics at a specified Ξ±). Build the model.
Assessing
1. Obtainaplotofthestandardizedresidualagainstthestandardizedpredictedvalues. Examine this plot for heterogeneity in the distribution of the residuals. A desirable pattern for the residuals would have both negative and positive residuals of equal magnitude throughout the length of the predicted regression. The envelope of residuals around the regression line should appear to be rectangular and be centered on the regression line.
13
2. Examinethecorrelationsamongpairsofpredictorvariablestocheckformulti- colinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member.
3. Examinethediagnosticplotstomakesurethattherearenoobservationswithhigh leverage or high influence. Influential data points will have Cookβs D values greater than 1.
4. Comparealternativemodelstodetermineifoneormoremodelsfitthedataequally well.
5. Themodelwiththebestresidualpattern,thatisnotbesetwithcolinearityand influential data points, and that has the highest R2 is the best model. Note that R2 is the last criteria to use in choosing a model, not the first.
The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit.
DESCRIPTION OF DATA
The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets):
VARIABLE (UNITS)
______________________________________ Mean elevation (feet)
Mean temperature (degrees F)
Mean annual precipitation (inches)
Vegetative density (percent cover) Drainage area (miles2)
Latitude (degrees)
Longitude (degrees)
Elevation at temperature station (feet)
1-hour, 25-year precipitation. intensity (inches/hour) Annual water yield (inches) (Dependent variable)
The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA.
14

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.

R Code is given below:

multr2 <- read.csv("C:/Users/oster/Downloads/SR6898998/multr2.csv")

hist(annwatyield)

qqnorm(annwatyield)

logy = log (annwatyield)

dd1=data.frame(meanelev,meantemp,mannprecip,vegden,area,lat,long,elevtemp,precipinten,annwatyield,logy)

head(dd1)

options(digits=4)

cor(dd1)

plot(meanelev,logy)

plot(meantemp,logy)

plot(mannprecip,logy)

plot(vegden,logy)

plot(area,logy)

plot(lat,logy)

plot(long,logy)

plot(elevtemp,logy)

plot(precipinten,logy)

logvegden = log (vegden)

logarea = log (area)

loglat = log (lat)

loglong = log (long)

logelevtemp = log (elevtemp)

logprecipinten = log (precipinten)

hist(logy)...