If you have a dataframe d, then the expression d will return:
A dataframe with one row (which is going to be the first row of d)
A dataframe with a single column (and that column will contain the values from the first column of d)
The values from the first row of d (returned as a vector)
The values from the first column of d (returned as a vector)
Suppose you run the following piece of code:
Then somewhere later in your code you are going to use that function (assume Z was not touched in between):
What will be the results of the three function calls shown above?
Suppose you have a vector X of some values (observations). Which of the following commands will calculate the 83% quantile of the data (approximately - do not worry about ties, smoothing between discrete data points or rounding up vs rounding down):
x[ round(0.83*length(x)) ]
Bias of a statistical model is
- an error introduced by training the model on a specific training set and thus fitting the particular realization of the noise in that dataset
- the error introduced by approximating real-life (usually unknown) dependence of outcome Y on the explanatory variables X with a simpler model
- The error due to unmeasured/unmeasurable variation in the data
- The average sum of squared differences between observed and predicted values of the outcome Y
In the prediction problem, we are trying to learn from existing data with the primary goal of predicting the outcome in new cases. The simplicity/interpretability of the model is valuable, but may be of less importance than the accuracy of the predictions.
In the inference problem we are more interested in understanding the data at hand: finding out which predictors are important, discovering the relationships between the outcome and the predictors, etc. Simplicity/interpretability of the model has higher importance in this setting.
In order to perform initial data data exploration/summarization, it is often useful to:
Look at the empirical distributions (histograms) of the variables in the dataset
All of the suggested options
Look at pairwise scatterplots between continuous variables
Look at summary statistics/boxplots of continuous variables, stratified by categorical variables (if any)
We say that the model overfits the data when it achieves very small MSE on the training set, but fails to predict well (results in a large MSE) on the test set.
Q2) 2,4, Error...
By purchasing this solution you'll be able to access the following files: