Question 1 (25 points)
The California school data set from class actually contains the results on two tests: a math test and a reading test. You regress average reading score (readscr) in each school district on the percent of students in the district who have English as a second language (elpct) and obtain:
Heteroskedasticity-robust standard errors are displayed in parentheses beneath each coecient. You can assume that assumptions 2 and 3 of the OLS are satisfied.
(a) Do you think that assumption 1 is satisfied for the proposed regression model? Argue.
(b) Interpret the estimate of the slope coecient 1. Does this estimate capture a causal e↵ect?
(c) If you were given another sample, would you expect to find a negative point estimate of 1?
(Hint: First, try to generalize from the sample to the population. Then, answer the question)
(d) Suppose you want to test for a positive relationship between the two variables with a 5% significance level. Can you use a symmetric 95% confidence interval to reach a conclusion about this test? If yes, construct the confidence interval and make a decision. If not, state the null and alternative hypotheses you would use, compute the t-statistic, and draw a graph to
show how you would calculate the p-value.
(e) A policy-maker comes to you and asks you to evaluate the e↵ects of a policy change which
would increase elpct by 10 percent in each school district. Assume that the estimated coef- ficient reflects a causal e↵ect and provide him with a range of predictions on the change in average reading score resulting from the policy change. Choose the range so that you have high-confidence that you are giving her the right answer.
readscr=666.930.76elpct R2 =0.4765,suˆ =14.57 (0.96) (0.03)
Question 2 (25 points)
Suppose you want to investigate the relationship between health and health care expenditure using the model:
Hi = 0 + 1Ei + ui
where Hi is health of individual i (measured on a continuous scale from 1 to 10) and Ei is expen- diture on health care by individual i (measured in dollars). You have access to an observational dataset put together by the U.S. Health Department.
(a) Interpret the parameter of interest 1. Is it a causal e↵ect?
(b) Is assumption 1 of the OLS (i.e. E[ui|Ei] = 0) likely to hold in this setting? Argue.
(c) Suppose assumption 1 does not hold (i.e. E[ui|Ei] 6= 0). What would be the resulting problem? Is there anything you could do to mitigate it?
(d) Suppose now that the true relationship between health and health care expenditure is given by:
Hi =0 +1Ei +2Wi +"i
where Wi is wealth of individual i and E["i|Ei,Wi] = 0. Provide a convincing argument of
what would be the sign of the omitted variable bias if you were to omit Wi from the regression.
(e) The U.S. Health Department is very concerned about understanding the causal e↵ect of health care provisions (that is, individual expenditure in health care) on health, and has decided to invest some funds to run an experiment. You have been hired as a consultant to design the
experiment. How would you set it up? Please describe each step carefully.
Question 3 (25 points)
You have data from 220 school districts in Massachusetts on scorei, the average test score for 4th graders in district i; expregi, district education expenditures per pupil for district i (in 1000s); stratioi, the average student-teacher ratio in district i; salaryi, the average teacher salary in district i (in 1000s); englishi, the percent of English learners in district i; and bfi, the percent of students qualifying for free breakfast in district i. First, you regress scorei on expregi and obtain:
Estimate Std. Error z value Pr(>|z|) (Intercept) 695.622 5.368 129.584 <2e-16***
expreg 3.084 1.145 2.694 0.00761**
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Then, you regress scorei on expregi, stratioi, salaryi, englishi and bfi and obtain:
10.21461 1.14041 0.36471 0.26377 0.30413 0.06004
68.429 0.817 -1.392 2.754 -1.183 -11.449
0.93915 -0.50777 0.72634 -0.35965 -0.68748
Estimate Std. Error z value
Pr(>|z|) <2e-16 *** 0.41507 0.16548 0.00647 ** 0.23847 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(a) Interpret the estimated coecient of expregi in both models. Is it likely to be a causal e↵ect?
(b) Give a brief and intuitive explanation why expregi has a significant positive e↵ect on test
scores in the first set of estimates but is insignificant in the second set of estimates.
(c) Do you think the first model su↵ers from OVB? If so, is it under-estimating or over-estimating
the e↵ect of expreg on score?
(d) Which model is better to describe the sample at hand? Why?
(e) A local politician sees the results from the second regression and points out that the coecient in front of bf is negative and statistically significant. He interprets it as evidence of the fact that making the eligibility criteria for free breakfast stricter (i.e. reducing the percentage of students that qualify for free breakfast) would increase average test score. What could go wrong with this reasoning?
Question 4 (25 points)
In class, we discussed the implications to estimation and inference when the error terms are het- eroskedastic. In this question, you are going to investigate this claim further with a simulation using R. We are going to consider two models:
Yi = 0 + 1Xi + ui where ui ⇠ N (0, 2) (Model 1) Yi = 0 + 1Xi + ⌫i where ⌫i ⇠ N ⇣0, (xi 0.5)2 ⌘ (Model 2)
In both models we have that 0 = 2, 1 = 5, Xi ⇠ N (1, 4), where N stands for a normal distribution.
(a) Discuss whether the OLS estimators of 0 and 1 are consistent in model 1 and in model 2.
in model 1? And in model 2? Explain.
(c) Draw a sample of size n = 30 and produce a scatter plot of Y and X for each model.
(d) For each model, plot the sampling distribution of the OLS estimator for the slope coecient 1. Assume sample size n = 30 and run 1000 simulations for each model.
(Hint: For each model, draw a sample, compute the OLS estimator, store it, and repeat)
(e) Based on your results from part (d), is the sampling distribution of the OLS estimator di↵erent across the two models? What changes? Why?
Bonus Question (25 points)
This bonus question is an extension of Question 4. Suppose you want to test the following hypothe-
s e s :
8< H 0 : 1 = 0 :H1 :1 6=0
Describe the steps you would follow to come up with a conclusion to the test
Assume sample size n = 30. Run 2000 simulations of the test you described in point (a) using first homoskedastic standard errors and then heteroskedastic standard errors. Save the results for each test. Do it for both Model 1 and Model 2.
Hint: Use the R command lm to run each regression and access the p-value. If your regression output is saved in the variable reg you can access the homoskedastic p-value for the slope with summary(reg) coefficients[2,4] and the heteroskedasticity-robust p-value for the slope with coeftest(reg,df=Inf,vcov=vcovHC(reg,type="HC1"))[2,4]
For each model, produce a table reporting the percentage of the times you reject the null hypothesis when you use homoskedastic standard errors and the percentage of the times you reject the null hypothesis when you use heteroskedastic standard errors.
Comment on the results obtained in point (c). Are they in line with what you expected? Why/why not?
This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.