## Question

One last thing: please don't use temporal data (where the observations are taken over different times), and please don't use a dataset where your dependent variable is categorical (including binary).

1. Describe your substantive interest and the general questions(s) you would like to answer (eg, "Does more education cause people to become more liberal?"). Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis.

2. Describe the data set you have found, including its source, its contents, and why it was collected originally.

3. What is your dependent variable? Why are you interested in explaining it? What do you hypothesize are the major factors that influence or cause it?

4. What are your independent variables, and why have you chosen these? Prior to running your regression, what effects do you expect them to have on the dependent variable? Which of these variables do you think affect other of the independent variables, and how might that affect your final results?

5. Explain and show in detail how you rename and recode the variables you are examining, and what units each are measured in.

6. Before running a multiple regression, run a few bivariate regressions of Y on some of your X variables. What do you infer? Which of these do you think might change with the addition of multiple variables?

7. Run your full multiple regression using lm() and present your results using the output from the stargazer R package Interpret the coefficients. What do they tell you substantively? Which variables seem to have the biggest substantive impact? Which ones could you actually change with some intervention, and how big a difference do you think that could make?

8. How have any of the coefficients changed from the bivariate regressions? What can you infer from that? How do you think your various independent variables interact and affect each other? Try to find an example where a variable appears signficant in the bivariate regression, but not in the full regression. Is this an example of a spurious or a chained causal pathway?

9. How does what you see match, or not, your hypotheses from (4)? Why did/didn't it match what you expected?

10. What do the R2 and adjusted R2 tell you about your model?

11. How would you use one of the variable selection methods to choose a model with fewer variables? Select one of the methods (either one of the stepwise or criterion-based methods) and show which variables it would lead you to keep. Do you agree with its results?

12. What are your overall conclusions? What are the weaknesses of your results, and how could you improve them with better or different data?

13. Calculations (using R):

a. Derive the coefficients from your regression using the (X'X)-1X formula. (If you run into problems using solve(), try using ginv (instead, which does the same thing but is a bit more robust)

b. For one of the coefficients, confirm its P value as shown in the regression output using the coefficient, its standard error, and pt () in R.

c. Calculate the R2 and adjusted R2 using R, and confirm that your results match the regression output.

d. Calulate the F statistic using R and confirm it against the regression output.

14. Add at least one quadratic term into your model and interpret the results. Is it significant? What is the effect of a 1-unit increase in that variable at its mean value?

15. Add at least one interaction term to you model and interpret the results. Is it significant? What is the effect of a 1-unit increase in one of those interacted variables holding the other at its mean value?

16. Test either the model in 14 or the model in 15 using the F test for nested models. That is, estimate the full model with the variable and quadratic term or the variable and interaction, and then estimate the reduced model without either, and run the F test to establish whether those variables significantly improve your model.

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

1. Describe your substantive interest and the general questions(s) you would like to answer (eg, “Does more education cause people to become more liberal?”).Be sure to frame it in a such a way that you are proposing a hypothesis (or multiple hypotheses) that might be either confirmed or disproven by the results of your analysis.

Using crime data for each city, I would like to examine if the amount of annual police funding per residents will lose its effectiveness in keeping the total overall reported crime rate (per 1 million residents) down while its effect is expected to be always positive. It suggests that the quadratic term for the annual police funding has a negative sign, and it has to be statistically significant to make the claim.

2. Describe the data set you have found, including its source, its contents, and why it was collected originally.

This is what editorial reviews has “An invaluable compendium of lifestyle factors in 219”micropolitan” areas–cities with 15,000 to 50,000 residents and their surrounding regions. Each community is graded in terms of its performance in such categories as climate/environment, public safety, health care, economics, recreation, and housing.”

Other variables available are, reported violent crime rate per 100,000 residents, percentage of people 25 years or greater with 4 years of high school, percentage of 16 to 19 year-olds not in high school and not high school graduates, percentage of 18 to 24 year-olds in college, percentage of people 25 years or greater with at least 4 years of college.

The data is obtained from Life In America’s Small Cities by G. S. Thomas....

By purchasing this solution you'll be able to access the following files:

Solution.Rmd and Solution.pdf.