## Transcribed Text

1.[Analytical question] Consider two Normally distributed random variables Y1and Y2 with
expected values μ1 and μ2, variances σ12 and σ2, and correlation ρ.
(a) State the joint probability distribution of these random variables. State it twice: once in a non-matrix and the second time in a matrix form. Explain the meaning of each term.
(b) Use Bayes theorem to derive the conditional probability distribution of Y1|Y2 and of Y2|Y1
(c) Does the correlation or the parameter of linear regression depend on whether we want to predict Y1 as function of Y2, or Y2 as function of Y1?
(d) Use the derivations above to explain the difference between the coefficient of cor- relation and the slope of linear regression.
2. [Analytical question] Consider the following loss functions for error terms ei, i = 1, . . . , N in linear regression. For each loss function, (i) state whether it is convex, (ii) provide a mathematical proof, and (iii) explain how it can be useful in the context of linear regression.
(a) Quadratic loss (related to mean squared error, L2 norm) L = Ni=1 e2i (b) Mean absolute error (L1 norm) L = Ni=1 |ei|
(c) Huber loss (smooth mean absolute error) with parameter δ
N 12e2, if |e|≤δ L= l(ei),wherel(e)= δ|e|−1δ2, if |e|>δ
2
3. [Analytical question] For linear regression Yi = θ0 + θ1Xi + ei, i = 1, . . . , N minimizing
squared loss:
(a) Write down the likelihood on the training data, and analytically derive the maxi- mum likelihood solution for parameter estimates.
(b) Calculate the gradient with respect to the parameter vector.
(c) Write down the steps of the (batch) gradient descent rule.
(d) Write down the steps of the stochastic gradient descent rule.
i=1
1
4. [Implementation question]
(a) Overlay graphs of the loss functions in question 2 for a range of e (consider two different values of δ for Huber loss). Use the graph to discuss the relative advantages and disadvantages of these loss functions for linear regression.
(b) Implement gradient descent for the loss functions above.
(c) Implement stochastic gradient descent for the loss functions above
5. [Implementation question] In this question we will revisit JW Figure 3.3, and empirically evaluate various approaches to fitting linear regression.
(a) Simulate N=50 values of Xi, distributed Uniformly on interval (-2,2). Simulate the values of Yi = 3 + 2Xi + ei , where ei is drawn from N (0, 4). Fit linear regression with squared loss to the simulated data using (i) analytical solution, (ii) batch gradient descent, and (iii) stochastic gradient descent implemented in Question 4. Set learning rate α to a small value (say, α = 0.01).
(b) Repeat (a) 1,000 times, overlay the histograms of the estimates of the slopes, and overlay the true value. Comment on how the choice of the algorithm affects the estimates of the slope parameter.
(c) Simulate N=50 values of Xi, distributed Uniformly on interval (-2,2). Simulate the values of Yi = 3+2Xi +ei, where ei is drawn from N (0, 4). Fit linear regression with (i) squared loss with the analytical solution, (ii) mean absolute error with batch gradient descent, and (iii) Huber loss with batch gradient descent implemented in Question 4. Set learning rate α to a small value (say, α = 0.01).
(d) Repeat (c) 1,000 times, overlay the histograms of the estimates of the slopes, and overlay the true value. Comment on how the choice of the loss function in the case of Normal distribution affects the estimates of the slope parameter.
(e) Simulate N=50 values of Xi, distributed Uniformly on interval (-2,2). Simulate the values of Yi = 3 + 2Xi + ei, where ei is drawn from N(0,4). Modify the simulated values of Y to introduce outliers, as follows. With probability 0.1, select an observation for modification. If it is selected, increase its value by 200% with probability 0.5, and decrease its value by 200% with probability 0.5. Fit linear regression to the modified data, with (i) squared loss with the analytical solution, (ii) mean absolute error with batch gradient descent, and (iii) Huber loss with batch gradient descent implemented in Question 4. Set learning rate α to a small value (say, α = 0.01).
(f) Repeat (c) 1,000 times, overlay the histograms of the estimates of the slopes, and overlay the true value. Comment on how the choice of the loss function in presence of outliers affects the estimates of the slope parameter.
2

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.