## Question

> install.packages(“ISLR”)

> library (ISLR)

> auto=read.csv("Auto.csv")

> attach(auto)

> set.seed (1)

> train=sample (392 ,196)

We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.

> lm.fit =lm(mpg∼horsepower ,data=auto ,subset =train )

We now use the predict() function to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set. Note that the -train index below selects only the observations that are not in the training set.

> attach (auto)

> mean((mpg -predict (lm.fit ,auto))[-train ]^2) [1] 26.14142

We can use the poly() function to estimate the test error for the quadratic and cubic regressions.

> lm.fit2=lm(mpg∼poly(horsepower ,2) ,data=auto ,subset =train )

> mean((mpg -predict (lm.fit2 ,auto))[-train ]^2) [1] 19.82259

> lm.fit3=lm(mpg∼poly(horsepower ,3) ,data=auto ,subset =train )

> mean((mpg -predict (lm.fit3 ,auto))[-train ]^2) [1] 19.78252

Questions 1-3:

Implement the same validation test for different split of the data into training and validation subsets. Do this by running the random generation number 100 times.

The MSE for the linear model is

a. 22.134 b. 22.132 c. 26.134 d. None of above

The MSE for the quadratic model is

a. 21.617 b. 21.364 c. 21.123 d. None of above

The MSE for the cubic model is

a. 21.701 b. 21.034 c. 20.272 d. None of above

2. Leave-One-Out Cross-Validation.

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. For instance,

> glm.fit=glm(mpg∼horsepower ,data=auto)

> coef(glm.fit) (Intercept) horsepower

39.9358610 -0.1578447

> lm.fit =lm(mpg∼horsepower ,data=auto)

> coef(lm.fit) (Intercept) horsepower

39.9358610 -0.1578447

In this lab, we will perform linear regression using the glm() function rather than the lm() function Because the former can be used together with cv.glm(). The cv.glm() function is part of the boot library.

> library (boot)

> glm.fit=glm(mpg∼horsepower ,data=auto)

> cv.err =cv.glm(auto ,glm.fit)

> cv.err$delta

[1] 24.23151 24.23114

The two numbers in the delta vector contain the cross-validation results. In this case the numbers are identical (up to two decimal places) and correspond to the MSE computed for LOOCV (see formula

5.1 of the text book).

We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we use the for () function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the ith element of the vector cv.error. We begin by initializing the vector. Do not forget to read Auto.csv file if your R studio session is not saved.

> cv.error=rep (0,5)

> for (i in 1:5){

+ glm.fit=glm(mpg∼poly(horsepower ,i),data=auto)

+ cv.error[i]=cv.glm (auto ,glm.fit)$delta [1]

+ }

>

> cv.error

[1] 24.23151 19.24821 19.33498 19.42443 19.03321

>

Question 4:

Run this model for polynomial function of order 6.

The MSE for this model is

a. 20.034 b. 18.979 c. 19.134 d. None of above

3. k-Fold Cross-Validation.

The cv.glm() function can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, o n the auto data set. We once again set a random seed and initialize a vector in which we will

store the CV errors corresponding to the polynomial fits of orders one to ten.

> set.seed (17)

> cv.error.10= rep (0 ,10)

> for (i in 1:10) {

+ glm.fit=glm(mpg∼poly(horsepower,i),data=auto)

+ cv.error.10[i]=cv.glm (auto,glm.fit ,K=10) $delta [1]

+ }

> cv.error.10

[1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609 19.71201 18.95140 19.50196

Notice that the computation time is much shorter than that of LOOCV. We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.

Questions 5-6:

Implement the k-fold cross validation for k=5 for polynomial models up to 5th degree. Use set.seed(25).

The MSE for the linear model is

a. 14.73 b. 23.132 c. 24.21 d. None of above

The MSE for the quadric model is

a. 19.13 b. 19.14 c. 20.13 d. None of above

4. The Bootstrap

We illustrate the use of the bootstrap in the simple example of Section 5.2 using Portfolio data set, as well as on an example involving estimating the accuracy of the linear regression model on the auto data set. One of the great advantages of the bootstrap approach is that it can be applied in almost all situations.

No complicated mathematical calculations are required. Performing a bootstrap analysis in R entails only two steps: First, we must create a function that computes the statistic of interest. Second, we use the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement.

To use Portfolio data you need to install boot library (we did this before).

> library(boot)

> names(Portfolio) [1] "X" "Y"

To illustrate the use of the bootstrap on this data, we must first create a function, alpha.fn(), which takes as inpu t the (X, Y) data as well as a vector indicating which observations should be used to estimate α. The function th en outputs the estimate for α based on the selected observations.

> alpha.fn=function (data ,index){

+ X=data$X [index]

+ Y=data$Y [index]

+ return ((var(Y)-cov (X,Y))/(var(X)+var(Y) -2* cov(X,Y)))

+ }

This function returns, or outputs, an estimate for α based on applying (5.7) to the observations indexed by the argument index. For instance, the following command tells R to estimate α using all 100 observations.

> alpha.fn(Portfolio ,1:100)

[1] 0.5758321

The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing α based on the new data set.

> set.seed (1)

> alpha.fn(Portfolio ,sample (100 ,100 , replace =T)) [1] 0.5963833

This is one bootstrap. We can implement a bootstrap analysis by performing this command many times, recordi ng all of the corresponding estimates for α, and computing the resulting standard deviation.

However, the boot() function automates this approach. Below we produce R = 1,000 bootstrap estimates for α

> boot(Portfolio,alpha.fn,R=1000) ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = Portfolio, statistic = alpha.fn, R = 1000)

Bootstrap Statistics :

original bias std. error t1* 0.5758321 -7.315422e-05 0.08861826

Questions 7-8:

Implement the bootstrap 5000 times.

The SE of estimated α is

a. 0.0905 b. 0.0893 c. 0.0897

The bootstrap bias is

a. 0.0032

b. 0.0025

c. 0.0027

d. None of above

5. The Bootstrap: Estimating the Accuracy of a Linear Regression Model

The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimate s for β0 and β1, the intercept and slope terms for the linear regression model that uses

horsepower to predict mpg in the auto data set. We will compare the estimates obtained using the bootstrap to t hose obtained using the formulas for SE(β0) and SE(β1).

We first create a simple function, boot.fn(), which takes in the auto data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model.

We then apply this function to the full set of 392 observations in order to compute the estimates

of β0 and β1 on the entire data set using the usual linear regression coefficient estimate formulas from Chapter 3. Note that we do not need the { and } at the beginning and end of the function because it is only one line long:

> boot.fn=function (data ,index )

+ return (coef(lm(mpg∼horsepower ,data=data ,subset =index)))

> boot.fn(auto ,1:392) (Intercept) horsepower 39.9358610 -0.1578447

The boot.fn() function can also be used in order to create bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement. Here we give two examples

> set.seed (1)

> boot.fn(auto ,sample (392 ,392 , replace =T)) (Intercept) horsepower

38.7387134 -0.1481952

> boot.fn(auto ,sample (392 ,392 , replace =T)) (Intercept) horsepower

40.0383086 -0.1596104

These are two one time bootstraps.

Next, we use the boot() function to compute the standard errors of 1,000 bootstrap estimates for the intercept a nd slope terms.

> boot(auto ,boot.fn ,1000) ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = auto, statistic = boot.fn, R = 1000)

Bootstrap Statistics :

original bias std. error t1* 39.9358610 0.02972191 0.860007896

t2* -0.1578447 -0.00030823 0.007404467

This indicates that the bootstrap estimate for SE(β0) is 0.86, and that the bootstrap estimate for SE(β1) is 0.00

74. As discussed in Section 3.1.2, standard formulas can be used to compute the standard errors for the regres sion coefficients in a linear model. These can be obtained using the summary() function:

> summary (lm(mpg∼horsepower ,data=Auto))$coef

Estimate Std. Error t value Pr(>|t|) (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187

horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81

Does this difference indicate a problem with the bootstrap? In fact, it suggests the opposite. Remember that ordinary least square regression (OLS) needs the assumption about common variance

σ2 for all residuals. If this assumption doesn’t work we have overestimated value of σ2

and, thus SE(β0) and SE( β1).

The bootstrap approach does not rely on any of the least square regression assumptions, and so it is li kely giving a more accurate estimate of the standard errors of β0 and β1 than is the

summary() function.

Then we compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data.

Note that the bootstrap procedure boot is sensitive to the number of random numbers generated before its implementation (set.seed() command).

> boot.fn=function (data ,index )

+ coefficients(lm(mpg∼horsepower +I( horsepower ^2) ,data=data ,

+ subset =index))

> set.seed (1)

> boot(auto ,boot.fn ,1000) ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = auto, statistic = boot.fn, R = 1000)

Bootstrap Statistics :

original bias std. error t1* 56.900099702 6.098115e-03 2.0944855842

t2* -0.466189630 -1.777108e-04 0.0334123802

t3* 0.001230536 1.324315e-06 0.0001208339

Since this model provides a good fit to the data (Figure 3.8), there is now a better correspondence

between the bootstrap estimates and the standard estimates of SE(β0) , SE(β1) and SE(β2) obtained directly by formulae:

> summary (lm(mpg∼horsepower +I(horsepower ^2) ,data=auto))$coef

Estimate Std. Error t value Pr(>|t|) (Intercept) 56.900099702 1.8004268063 31.60367 1.740911e-109

horsepower -0.466189630 0.0311246171 -14.97816 2.289429e-40

I(horsepower^2) 0.001230536 0.0001220759 10.08009 2.196340e-21

Questions 7-9:

Use boot.fn() function to estimate coefficients of cubic model on 200 observations from auto file. What is the estimated intercept?

a. 3.92 b. 4.01 c. 3.91 d. None of above What is the estimated β1?

a. -0.203 b. -0.0187 c. -1.87 d. None of above What is the estimated β3?

a. 0.0000025 b. 0.000025 c. 0.0000023 d. None of above

Questions 10-12:

Calculate the single bootstrap the obtained cubic model using set.seed(20) and number of records 250. What is the estimated intercept?

a. 0.56 b. 5.60 c. 4.68 d. None of above What is the estimated β2?

a. 0.0194 b. 0.0194 c. -0.00041 d. None of above

Questions 13-15:

Calculate 5000 bootstrap estimates for cubic model. Use set.seed(100). What is the estimate for intercept?

a. 6.07 b. 0.604 c. 2.02 d. None of above What is the estimated β3?

a. -0.000203 b. -0.0000212 c. -0.00000215 d. None of above

What is the estimated SE for β1?

a. 1.51 b. 2.63 c. 0.154 d. None of above

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

# 1. Validation set approachinstall.packages("ISLR")

library(ISLR)

auto <- read.csv("Auto.csv")

auto <- Auto

attach(auto)

set.seed(1)

train <- sample(392, 196)

lm.fit <- lm(mpg~horsepower, data=auto, subset=train)

attach(auto)

mean((mpg-predict(lm.fit, auto))[-train]^2)

lm.fit2 <- lm(mpg~poly(horsepower, 2), auto, subset=train)

mean((mpg-predict(lm.fit2, auto))[-train]^2)

lm.fit3 <- lm(mpg~poly(horsepower, 3), auto, subset=train)

mean((mpg-predict(lm.fit3, auto))[-train]^2)

# Questions 1-3:

mse <- matrix(0, nrow=100, ncol=3)

i <- 1

while (TRUE) {

train <- sample(392, 196)

lm.fit <- lm(mpg~horsepower, data=auto, subset=train)

mse[i, 1] <- mean((mpg-predict(lm.fit, auto))[-train]^2)

lm.fit2 <- lm(mpg~poly(horsepower, 2), auto, subset=train)

mse[i, 2] <- mean((mpg-predict(lm.fit2, auto))[-train]^2)

lm.fit3 <- lm(mpg~poly(horsepower, 3), auto, subset=train)

mse[i, 3] <- mean((mpg-predict(lm.fit3, auto))[-train]^2)

if (i==100) {

break

} else {

i <- i + 1

}

}...