## Question

Question 1

If you have a dataframe d, then the expression d[1] will return:

A dataframe with one row (which is going to be the first row of d)

A dataframe with a single column (and that column will contain the values from the first column of d)

The values from the first row of d (returned as a vector)

The values from the first column of d (returned as a vector)

Question 2

Suppose you run the following piece of code:

Z=4

f=function(x=sqrt(z)) {

print (x)

}

Then somewhere later in your code you are going to use that function (assume Z was not touched in between):

f()

Z=16

f()

rm(z)

f()

What will be the results of the three function calls shown above?

2,4,2

2,4,4

2,4, Error

2,2,2

Question 3

Suppose you have a vector X of some values (observations). Which of the following commands will calculate the 83% quantile of the data (approximately - do not worry about ties, smoothing between discrete data points or rounding up vs rounding down):

sort(x)[round(0.83*length(x)) )]

x[ round(0.83*length(x)) ]

sort(x)[x<0.83]

x round(0.83*sum(x))]

Question 4

Bias of a statistical model is

- an error introduced by training the model on a specific training set and thus fitting the particular realization of the noise in that dataset

- the error introduced by approximating real-life (usually unknown) dependence of outcome Y on the explanatory variables X with a simpler model

- The error due to unmeasured/unmeasurable variation in the data

- The average sum of squared differences between observed and predicted values of the outcome Y

Question 5

In the prediction problem, we are trying to learn from existing data with the primary goal of predicting the outcome in new cases. The simplicity/interpretability of the model is valuable, but may be of less importance than the accuracy of the predictions.

True

False

Question 6

In the inference problem we are more interested in understanding the data at hand: finding out which predictors are important, discovering the relationships between the outcome and the predictors, etc. Simplicity/interpretability of the model has higher importance in this setting.

True

False

Question 7

In order to perform initial data data exploration/summarization, it is often useful to:

Look at the empirical distributions (histograms) of the variables in the dataset

All of the suggested options

Look at pairwise scatterplots between continuous variables

Look at summary statistics/boxplots of continuous variables, stratified by categorical variables (if any)

Question 8

We say that the model overfits the data when it achieves very small MSE on the training set, but fails to predict well (results in a large MSE) on the test set.

True

False

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

Q1) A dataframe with a single column (second choice)Q2) 2,4, Error...

By purchasing this solution you'll be able to access the following files:

Solution.png.