Multiple Choice Questions:
Question 1
If you have a dataframe d, then the expression d[1] will return:
A dataframe with one row (which is going to be the first row of d)
A dataframe with a single column (and that column will contain the values from the first column of d)
The values from the first row of d (returned as a vector)
The values from the first column of d (returned as a vector)

Question 2
Suppose you run the following piece of code:
Z=4
f=function(x=sqrt(z)) {
print (x)
}
Then somewhere later in your code you are going to use that function (assume Z was not touched in between):
f()
Z=16
f()
rm(z)
f()
What will be the results of the three function calls shown above?
2,4,2
2,4,4
2,4, Error
2,2,2

Question 3
Suppose you have a vector X of some values (observations). Which of the following commands will calculate the 83% quantile of the data (approximately - do not worry about ties, smoothing between discrete data points or rounding up vs rounding down):
sort(x)[round(0.83*length(x)) )]
x[ round(0.83*length(x)) ]
sort(x)[x<0.83]
x round(0.83*sum(x))]

Question 4
Bias of a statistical model is
- an error introduced by training the model on a specific training set and thus fitting the particular realization of the noise in that dataset
- the error introduced by approximating real-life (usually unknown) dependence of outcome Y on the explanatory variables X with a simpler model
- The error due to unmeasured/unmeasurable variation in the data
- The average sum of squared differences between observed and predicted values of the outcome Y

Question 5
In the prediction problem, we are trying to learn from existing data with the primary goal of predicting the outcome in new cases. The simplicity/interpretability of the model is valuable, but may be of less importance than the accuracy of the predictions.
True
False

Question 6
In the inference problem we are more interested in understanding the data at hand: finding out which predictors are important, discovering the relationships between the outcome and the predictors, etc. Simplicity/interpretability of the model has higher importance in this setting.
True
False

Question 7
In order to perform initial data data exploration/summarization, it is often useful to:
Look at the empirical distributions (histograms) of the variables in the dataset
All of the suggested options
Look at pairwise scatterplots between continuous variables
Look at summary statistics/boxplots of continuous variables, stratified by categorical variables (if any)

Question 8
We say that the model overfits the data when it achieves very small MSE on the training set, but fails to predict well (results in a large MSE) on the test set.
True
False

Multiple Choice Questions: Question 1 If you have a dataframe d,...