QuestionQuestion

Transcribed TextTranscribed Text

Problem 1 The baseball dataset consists of the statistics of 263 players in Major League Baseball in the season 1986. The dataset (hitters.csv) consist of 20 variables: Variable Description AtBat Number of times at bat in 1986 Hits Number of hits in 1986 HmRun Number of home runs in 1986 Runs Number of runs in 1986 RBI Number of runs batted in in 1986 Walks Number of walks in 1986 Years Number of years in major leagues CAtBat Number of times at bat during his career CHits Number of hits during his career CHmRun Number of home runs during his career CRuns Number of runs during his career CRBI Number of runs batted in during his career CWalks Number of walks during his career League A factor with levels A (coded as 1) and N (coded as 2) indicating player’s league at the end of 1986 Division A factor with levels E (coded as 1) and W (coded as 2) indicating player’s division at the end of 1986 PutOuts Number of put outs in 1986 Assists Number of assists in 1986 Errors Number of errors in 1986 Salary 1987 annual salary on opening day in thousands of dollars NewLeague A factor with levels A (coded as 1) and N (coded as 2) indicating player’s league at the beginning of 1987 In this problem, we use Salary as the response variable, and the rest 19 variables as predictors/covariates, which measure the performance of each player in season 1986 and his whole career. Write R functions to perform variable selection using best subset selection partnered with BIC (Bayesian Information Criterion): 1) Starting from the null model, apply the forward stepwise selection algorithm to produce a sequence of sub-models iteratively, and select a single best model using the BIC. Plot the “BIC vs Number of Variables” curve. Present the selected model with the corresponding BIC. 2) Starting from the full model (that is, the one obtained from minimizing the MSE/RSS using all the predictors), apply the backward stepwise selection algorithm to produce a sequence of sub-models iteratively, and select a single best model using the BIC. Plot the “BIC vs Number of Variables” curve. Present the selected model with the corresponding BIC. 3) Are the selected models from 1) and 2) the same? Problem 2 In this problem, we fit ridge regression on the same dataset as in Problem 1. First, standardize the variables so that they are on the same scale. Next, choose a grid of 𝜆 values ranging from 𝜆 = 1010 to 𝜆 = 10−2 , essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. For example: > grid = 10^seq(10, -2, length=100) 1) Write an R function to do the following: associated with each value of 𝜆 , compute a vector of ridge regression coefficients (including the intercept), stored in a 20 × 100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of 𝜆). 2) To find the “best” 𝜆 , use ten-fold cross-validation to choose the tuning parameter from the previous grid of values. Set a random seed – set.seed(1), first so your results will be reproducible, since the choice of the cross-validation folds is random. Plot the “Cross-Validation Error versus 𝜆” curve, and report the selected 𝜆. 3) Finally, refit the ridge regression model on the full dataset, using the value of 𝜆 chosen by cross-validation, and report the coefficient estimates. Remark: You should expect that none of the coefficients are zero – ridge regression does not perform variable selection. Problem 3 In this problem, we revisit the best subset selection problem. Given a response vector 𝑌 = (𝑦1, … , 𝑦𝑛 ) 𝑇 and an 𝑛 × 𝑝 design matrix 𝑋 = (𝑥1, … , 𝑥𝑛 ) 𝑇 with 𝑥𝑖 = (𝑥𝑖1, … , 𝑥𝑖𝑝) 𝑇 . For 1 ≤ 𝑘 ≤ 𝑝, let 𝛽̂ 0, 𝛽̂ be the solution to the following sparsityconstrained least squares problem: min 𝛽0,𝛽:∥𝛽∥0=𝑘 ∥ 𝑌 − 𝛽0 − 𝑋𝛽 ∥2 2 = min 𝛽0,𝛽:∥𝛽∥0=𝑘 ∑ (𝑦𝑖 − 𝛽0 − 𝑥𝑖 𝑇𝛽) 2 𝑛 𝑖=1 . Based on the property 𝛽̂ 0 = 𝑦̅ − 𝑥̅ 𝑇𝛽̂, we can center 𝑌 and 𝑋 first to get rid of the intercept, and solve min 𝛽:∥𝛽∥0=𝑘 ∥ 𝑌̃ − 𝑋̃𝛽 ∥2 2 , where 𝑌̃ and 𝑋̃ represent the centered 𝑌 and 𝑋, respectively. To solve this, we introduce the Gradient Hard Thresholding Pursuit (GraHTP) algorithm. Let 𝑓(𝛽) = ∥ 𝑌̃ − 𝑋̃𝛽 ∥2 2⁄(2𝑛) be the objective function. GraHTP Algorithm. Input: 𝑌̃, 𝑋̃, sparsity 𝑘, stepsize 𝜂 > 0 (Hint: normalize the columns of 𝑋̃ to have variance 1). Initialization: 𝛽 0 = 0, 𝑡 = 1. repeat 1) Compute 𝛽̃𝑡 = 𝛽 𝑡−1 − 𝜂∇𝑓(𝛽 𝑡−1 ); 2) Let 𝒮 𝑡 = supp(𝛽̃𝑡 , 𝑘) be the indices of 𝛽̃𝑡 with the largest 𝑘 absolute values; 3) Compute 𝛽 𝑡 = argmin{𝑓(𝛽); supp(𝛽) ⊆ 𝒮 𝑡 }; 𝑡 = 𝑡 + 1; until convergence, i.e. ∥ 𝛽 𝑡 − 𝛽 𝑡−1 ∥2< 10−4 . Output: 𝛽 𝑡 . 1) Write an R function to implement the above GraHTP algorithm. 2) Consider again the baseball dataset in Problem 1 with 𝑛 = 263, 𝑝 = 19. For 𝑘 = 1, … , 𝑝, use the above function to find the best 𝑘-sparse model, denoted by ℳ𝑘 . Then use BIC to select a single best model among ℳ1 , … ,ℳ𝑝. 3) Compare your result with those obtained in Problem 1.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

#####################
##### Problem 1 #####
#####################
hitters <- read.csv("hitters.csv")
hitters <- hitters[-1]

y <- hitters$Salary
X <- hitters[-19]

var.names <- names(X)

n <- nrow(X)
y <- as.matrix(y, ncol=1)
X <- as.matrix(X)

######
# 1) #
######
# M0 <- lm(y~1)
# step(M0, direction="forward")
ones <- matrix(1, nrow=n, ncol=1)
selected <- c()
BICs <- matrix(1, nrow=19, ncol=1)
for (k in 1:19) {
BIC <- Inf
for (i in 1:19) {
    if (length(selected)==0) {
      X_star <- ones
    } else {
      X_star <- cbind(ones, X[, selected])
    }
    if (i %in% selected) next
    X_star <- cbind(X_star, X[,i])
    beta <- solve(t(X_star)%*%X_star)%*%t(X_star)%*%y
    e <- y - X_star%*%beta
    rss <- t(e)%*%e
    if (n * log(rss/n) + (k+1)*log(n)< BIC) {
      BIC <- n * log(rss/n) + (k+1) * log(n)
      i_star <- i
    }
}
BICs[k] <- BIC
selected <- c(selected, i_star)
}...

By purchasing this solution you'll be able to access the following files:
Solution.R.

$58.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats