## Question

For this homework your report can be up to 8 pages (instead of the usual 5 pages). Consider the dataset contained in “BODY_FAT.TXT”.

Remember that the final aim of this analysis is to produce a satisfactory regression model for the percentage of body fat based on all or on a subsample of the available predictor variables. A regression model should be satisfactory in three aspects:

• variability explanation;

• diagnostics on the model assumptions;

• parsimony and interpretability.

Considers all 13 predictors in the dataset (12 numerical predictors and the categorical predictor Over45; remember you DON’T want to include Density as predictor, since you used it to compute the body fat percentage).

A. Select a model for predicting the body fat percentage, using backward elimination with alpha-to-remove 𝛼𝑅 = 0.10 (on the p-value of the t-test for single terms) starting from the multiple linear regression model that contains all 13 predictors.

Report:

• which predictor you remove at each step;

• the associated t-test p-value at each step;

• the names of the predictors included in the final regression model.

[Hint: use function update to update the model removing predictors.]

B. For each number of predictors (from 1 to 13 predictors, that means from 𝑝 = 2 to 𝑝 = 14), use best subset selection to select the best model according to the 𝑅𝑆𝑆 (i.e. the model with smallest 𝑅𝑆𝑆).

• Report the matrix with the selected models for each number of predictors.

• Consider the model selected using backward elimination in question A. Is it the best model according to 𝑅𝑆𝑆, among the ones with exactly the same number of predictors?

[Hint: use function regsubset from the package leaps to select the best model for each

𝑝. Use the argument nvmax=13 in order to compute the best model that contains up to 13 predictors.Use the function summary on its output to produce the matrix with the selected models]

C. For each of the 13 models selected by best subset selection based on 𝑅𝑆𝑆, compute:

• adjusted R-squared 𝑅𝑎𝑑𝑗2;

• Mallow’s 𝐶𝑝;

• 𝐴𝐼𝐶;

• 𝐵𝐼𝐶.

For each of these four measures of fit:

• produce a plot of its values against 𝑝;

• report which model is the best (explain how you choose it, and report the names of the predictors included).

[Hint: use the function summary on the regsubset output]

D. Perform model selection using lasso with 10-fold cross-validation. Produce the plot of cross-validation error versus the tuning parameter 𝜆 in order to choose the𝜆 with cross- validation error within one standard error from the minimum.

Report:

• the value chosen for the tuning parameter 𝜆;

• the names of the predictors selected by lasso (i.e. the predictors whose coefficient

𝛽̂𝑙𝑎𝑠𝑠𝑜 ≠ 0).

[Hint: use the function glmnet from the package glmnet with argument alpha=1 to fit lasso, and the function cv.glmnet with argument nfolds=k to select 𝜆 according to 10- fold cross-validation. The function plot on its output produces the required plot, and the element lambda.1se in its output is the required 𝜆. Use the function coef on glmnet output with argument s="lambda.1se" to obtain the coefficients 𝛽̂𝑙𝑎𝑠𝑠𝑜]

E. Evaluate the models selected in points A, C and D using diagnostic plots and

checking for multicollinearity (i.e. computing the 𝑉𝐼𝐹). Refine the models as needed, select a satisfactory model for the body fat percentage, and interpret your final model.

If you encounter more than one model that you believe to be satisfactory, and if you see an interest in terms of interpretation in presenting more than one final model, you are allowed and encouraged to do so (but do not present more than two final models).

[Hint: use function plot on lm output to produce diagnostic plots automatically. Use function vif from the package car to compute the 𝑉𝐼𝐹]

Dataset guide

The file “BODY_FAT.TXT” contains data adapted from a dataset posted by Roger W. Johnson (Dept. of Mathematics & Computer Science, South Dakota School of Mines & Technology). Johnson also provided the following references, that might be useful to you:

• Bailey, Covert (1994). Smart Exercise: Burning Fat, Getting Fit, Houghton-Mifflin Co., Boston, pp. 179-186.

• Behnke, A.R. and Wilmore, J.H. (1974). Evaluation and Regulation of Body Build and Composition, Prentice-Hall, Englewood Cliffs, N.J.

• Siri, W.E. (1956), "Gross composition of the body", in Advances in Biological and Medical Physics, vol. IV, edited by J.H. Lawrence and C.A. Tobias, Academic Press, Inc., New York.

• Katch, Frank and McArdle, William (1977). Nutrition, Weight Control, and Exercise, Houghton Mifflin Co., Boston.

• Wilmore, Jack (1976). Athletic Training and Physical Fitness: Physiological Principles of the Conditioning Process, Allyn and Bacon, Inc., Boston.

The dataset is about a sample of 252 men and contains the following variables:

• Density: density of the body, determined from underwater weighing

• SiriBFperc: percentage of body fat, calculated as a function of the Density according to Siri’s equation: (495/Density) – 450.

• Over45: indicator for age group (0: up to 45 years, 1: over 45)

• Weight: weight (lbs)

• Height: height (inches)

• NeckC: neck circumference (cm)

• ChestC: chest circumference (cm)

• AbdomenC: abdomen circumference (cm)

• HipC: hip circumference (cm)

• ThighC: thigh circumference (cm)

• KneeC: knee circumference (cm)

• AnkleC: ankle circumference (cm)

• BicepsC: biceps circumference (cm)

• ForearmC: forearm circumference (cm)

• WristC: wrist circumference (cm)

The most accurate way of calculating the percentage of body fat is the one provided by Siri’s equation (495/Density) – 450, which requires a measurement of Density via weighting under water. This measurement is expensive and unpractical. On the other hand, age and all the body measurements listed above are easy to obtain. Thus, we want to understand if we can reliably describe and predict the percentage of body fat using these other variables, through a regression model.

In the dataset, for age we only have a binary indicator separating men below and above 45 years. On the other hand, all the body measurements are continuous variables.

Notice also that body fat percentage is a quantity bound to vary between 0 and 100. This could cause problems when using linear regression models, because we could actually estimate mean levels or predict values smaller than 0 or larger than 100 on certain predictors ranges. However we can safely use linear regression as far as we move in ranges of the predictors where the fitted values are well above 0 and well below 100.

Throughout the semester, the “Application” component in each of the homework sets will consist of employing various modeling, inference and diagnostics techniques learned in class on these data – with the final aim of producing a satisfactory regression model for the percentage of body fat based on all or on a subsample of the available predictor variables.

NOTE: DENSITY (THE VARIABLE ON WHICH SIRI’S EQUATION IS BASED) WILL NOT BE USE AS A PREDICTOR.

When preparing the “Application” part of each homework, make sure that:

• It does not exceed 5 pages, including figures and tables.

• The answer to each question is divided in two parts, one devoted to technical details and outputs, and one devoted to interpretation of the results. The former can contain R output (only the relevant part of it!) and technical answers to the question. The latter should resemble a short report you would write for a client, i.e. be concise and informative but not contain technical terms, and not assume the reader has statistical knowledge.

Keep in mind that some erroneous values were detected in this data set

• Density is given to you so that you can verify whether there were mistakes in the calculation of Percentage of body fat through Siri’s formula.

• There may be some obvious measurement errors in Height, and in other predictor variables.

When performing the analyses required for each homework set, you can remove, or present results with and without, a few units (men) that appear to carry erroneous measurements. Never remove more than 10 units, and always provide an appropriate justification for removing units if you do (report which units you removed).

Remember that a regression model should be satisfactory in three aspects:

• variability explanation;

• diagnostics on the model assumptions;

• parsimony and interpretability.

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)

library(caret)

library(leaps)

library(modelr)

library(broom)

library(locfit)

library(ggplot2)

library(glmnet)

library(car)

```

```{r, echo=FALSE}

df <- read.table("BODY_FAT.txt", header = TRUE)

df$Density <- NULL

predictors_removed <- c()

predictors_p_value <- c()

```

## A

```{r, results='hide'}

fit <- lm(SiriBFperc ~., data = df)

summary(fit)

predictors_removed <- c(predictors_removed, "Over45")

predictors_p_value <- c(predictors_p_value, 0.9051)

fit <- update(fit, . ~ . -Over45)

summary(fit)

predictors_removed <- c(predictors_removed, "ChestC")

predictors_p_value <- c(predictors_p_value, 0.8625)

fit <- update(fit, . ~ . -ChestC)

summary(fit)

predictors_removed <- c(predictors_removed, "KneeC")

predictors_p_value <- c(predictors_p_value, 0.5643)

fit <- update(fit, . ~ . -KneeC)

summary(fit)

predictors_removed <- c(predictors_removed, "AnkleC")

predictors_p_value <- c(predictors_p_value, 0.4926)

fit <- update(fit, . ~ . -AnkleC)

summary(fit)

predictors_removed <- c(predictors_removed, "Height")

predictors_p_value <- c(predictors_p_value, 0.3403)

fit <- update(fit, . ~ . -Height)

summary(fit)

predictors_removed <- c(predictors_removed, "BicepsC")

predictors_p_value <- c(predictors_p_value, 0.23043)

fit <- update(fit, . ~ . -BicepsC)

summary(fit)...

By purchasing this solution you'll be able to access the following files:

Solution.Rmd and Solution.pdf.