SET 2

## Question 1: Application

The data in "dat_40_data.RData" are sampled from "pew_data.RData" using the code shown at the end of the problem set. The file "pew_data.RData" is constructed from the 2017 Pew Research Center Science and NewsSurvey, downloaded from https://www.journalism.org/datasets/2018/ on 4/16/2019. The site https://www.pewresearch.org/download-datasets/ has lots of interesting information. Note that some analyses of the surveys may require use of the weights provided with the data set.

The variable "LIFE" holds the responses to the question

LIFE
In general, would you say life in America today is better, worse or about the same as it was 50 years ago for people like you?

1 Better today

2 Worse today

3 About the same as it was 50 years ago

The variable "PPEDUCAT" holds the responses to the question

PPEDUCAT Education (Categorical)

-2 Not asked

-1 Refused

1 Less than high school

2 High school

3 Some college

4 Bachelors degree or higher

The variable PPINCIMP holds the responses to the question

PPINCIMP Household income

-2 Not asked

-1 Refused

1 Less than \$5,000

2 \$5,000 to \$7,499

3 \$7,500 to \$9,999

4 \$10,000 to \$12,499

5 \$12,500 to \$14,999

6 \$15,000 to \$19,999

7 \$20,000 to \$24,999

8 \$25,000 to \$29,999

9 \$30,000 to \$34,999

10 \$35,000 to \$39,999

11 \$40,000 to \$49,999

12 \$50,000 to \$59,999

13 \$60,000 to \$74,999

14 \$75,000 to \$84,999

15 \$85,000 to \$99,999

16 \$100,000 to \$124,999

17 \$125,000 to \$149,999

18 \$150,000 to \$174,999

19 \$175,000 to \$199,999

20 \$200,000 to \$249,999

21 \$250,000 or more

The variable "cell.id" gives the "PPEDUCAT" response followed by the "LIFE" response.

Please run a two way ANOVA of the income variable, PPINCIMP on the the LIFE variable, the PPEDUCAT variable and their interaction. Provide an interpretation of the results, including an assessment of the extent to which the assumptions of ANOVA as a test of equality of means are satisfied.

Visual data exploration:

```{r}
load("dat_40_data.RData")
dat.40$LIFE<-factor(dat.40$LIFE)
dat.40$PPEDUCAT<-factor(dat.40$PPEDUCAT)

```

Optionally, quantitative assessments can be done.

```{r}

```

```{r}

```

## Question 2: Type 2 Error

## Numerical Experiments

### Data output

Please create a .RData file with a data frame with the columns given below, and one row of data, generated in the questions, replacing the NAs. Please name the data frame and the data file with your last name, adjusted as necessary to be a valid variable name. For example, mine is sim.durso, eventually saved in durso.csv. We will bind all these data frames together to get a clearer picture of the range of type 2 error rates for each of the scenarios.

```{r}
sim.durso<-data.frame("normal"=NA,"unif"=NA,"gamma12"=NA,"gamma22"=NA)
```

The code below estimates the type 2 error rate (rejecting the null hypothesis when it's true) of ANOVA when there are 4 groups, each of size 10, with two groups consisting of iid Normal(1,1) values, and two consisting of iid Normal(0,1) with a cutoff of p=.05. Please run this code, uncommenting as necessary, with your own seed, and save the value in sim.your_name\$normal. Repeat for

* two groups with mean 1 and error terms drawn from $\sqrt {12}\cdot Uniform(-0.5,0.5)$ and two groups with mean 0 and error terms drawn from $\sqrt {12}\cdot Uniform(-0.5,0.5)$ (10 points)

* two groups with mean 1 and error terms drawn from $2\left( Gamma(shape=1,rate=2)-\frac{1}{2}\right)$ and two groups with mean 0 and error terms drawn from $2\left( Gamma(shape=1,rate=2)-\frac{1}{2}\right)$ (10 points)

* two groups with mean 1 and error terms drawn from $\sqrt{2}\left( Gamma(shape=2,rate=2)-1\right)$ and two groups with mean 0 and error terms drawn from $\sqrt{2}\left( Gamma(shape=2,rate=2)-1\right)$ (10 points)

```{r, cache=TRUE}
# n<-10
# K<-4
# a<-.05
# N<-10000
# class<-rep(letters[1:K],each=n)
# accept<-function(n,K,class,a){
# y<-rnorm(n*K)+rep(0:1,c(2*n,2*n))
# m<-aov(y~class)
# return(summary(m)[[1]]$Pr[1]>a)
# }
#
# set.seed(661997) #Please use a seed that represents a date that's specific to you, NOT this one.
#
# sim.durso$normal<-mean(replicate(N,accept(n,K,class,.05))) #Use your data frame.
#
# accept<-function(n,K,class,a){
# # your code here
# }
#
# sim.durso$unif<-mean(replicate(N,accept(n,K,class,.05)))
#
# accept<-function(n,K,class,a){
# # your code here
# }
#
# sim.durso$gamma12<-mean(replicate(N,accept(n,K,class,.05)))
#
# accept<-function(n,K,class,a){
# # your code here
# }

# sim.durso$gamma22<-mean(replicate(N,accept(n,K,class,.05)))
# write.csv(sim.durso,file="durso.csv")
#
# sim.durso<-read.csv("durso.csv")

```

## Question 3: Comparing ANOVA and regression results (10 points)

As with t tests and ANOVA, equivalently-specified regression and ANOVA models using the same data will results in the same conclusions being drawn. Using the same data from Question 1, we can run equivalently-specified ANOVA and linear regression models. For this question, the only predictor used to predict PPINCIMP will be PPEDUCAT.

One-way ANOVA results:
```{r}
summary(aov(PPINCIMP~PPEDUCAT,data=dat.40))
```

Equivalent linear regression results:
```{r}
regmodel<-lm(PPINCIMP~PPEDUCAT,data=dat.40)
summary(regmodel)
```

For this question, compare the output of these two models. What aspects of the output demonstrate the equivalence of these two models (hint: there are at least three)? What is the relationship between the PPEDUCAT estimate from the ANOVA model and the (Intercept), PPEDUCAT2, PPEDUCAT3, and PPEDUCAT4 estimates from the regression model?

## Construction of application data

This is just in case you're curious about the process.

```{r eval=FALSE}
load("pew_data.RData")

```

Sample 40 from each cell

```{r eval=FALSE}
dat.trim$cell.id<-factor(dat.trim$PPEDUCAT):factor(dat.trim$LIFE)
cells<-split(dat.trim,dat.trim$cell.id)
sampler<-function(x,n){
p<-x$weight/sum(x$weight)
ind<-1:nrow(x)
ind<-sample(ind,n,prob = p,replace = TRUE)
return(x[ind,])
}
set.seed(56789)
samp<-lapply(cells,sampler,n=40)
dim(samp[[1]])

sampl<-suppressWarnings(bind_rows(samp))
dat.40<-select(sampl,PPINCIMP,PPEDUCAT,LIFE,cell.id)
save(dat.40,file="dat_40_data.RData")

```
SET 3

## Question 1

Is type of fish consumption ("fishpart") independent of "fisherman" classification?

```{r}

```

## Question 2

Are the data consistent with the null hypothesis that the mean of total mercury is equal in the fisherman and non-fisherman population?

```{r}

```

## Question 3

* Please fit a regression model "TotHg"" on the numeric variables "fishmlwk" and "weight" and the categorical variable "fishpart".

* Please display a pairs-type plot for these variables.

* Also assess the diagnostic plots for the model.

* Would you consider removing any data points?

```{r}

```

## Question 4

* Please fit a regression model "log(TotHg)" on the numeric variables "fishmlwk" and "weight" and the categorical variable "fishpart".

* Please display a pairs-type plot for these variables.

* Assess the diagnostic plots for the model.

* Would you consider removing any data points?

* How does this model compare the the model in Question 4?

```{r}

```

## Question 5 - Box-Cox transformations

The Box-Cox transformations are a parametrized family of power transformations designed to be applied to the outcome variable to improve the Normality of residuals of a linear model. For $\lambda\neq0$, the transformation maps $y$ to $\frac{y^\lambda-1}{\lambda}$ while for $\lambda=0$, the tranformation maps $y$ to $\ln y$.

For each value of $\lambda$ in the range of the argument "lambda", the "boxcox" function in the "MASS" package fits the linear model it is given as an argument but with the Box-Cox transformation applied to the outcome variable, assumed to be positive. The function "boxcox" computes the log likelihood of the residuals under the assumption of Normality. This is plotted against the $\lambda$'s and the $\lambda$'s and the corresponding log likelihoods are returned.

In typical use, a value of $\lambda$ close to maximizing the log likelihood is chosen and regression performed with this transformation applied to the outcome variable. This process is partially carried out below. The regression is for "TotHg" regressed on all the explanatory variables in the data set restricted to cases in which "fisherman" equals 1. Please complete the process as requested.

#### Fit the base model.

```{r}
dat<-read.csv("../ps4/fishermen_mercury.csv")
dat<-dat[dat$fisherman==1,]
dat$fishpart<-factor(dat$fishpart)

fmla<-str_c("TotHg~",str_c(names(dat)[2:7],collapse ="+"))
m<-lm(fmla,data=dat)
```

#### Run "boxcox" on the base model with default values for the remaining arguments.

```{r}
lambda<-boxcox(m)
```

#### Extract the $\lambda$ corresponding to the maximum profile log likelihood.

```{r}
ll.best<-which(lambda[[2]]==max(lambda[[2]]))
lambda.best<-lambda[[1]][ll.best]
```

#### Refit the model using the transformation of the outcome variable corresponding to this $\lambda$.

Please carry out this model fitting and do the usual plot-based model diagnostics - 8 points

```{r}

```

Please compare the Normality of the residual for the model with the transformed outcome variable and the model using the original outcome variable - 2 points

```{r}

```

SET 4

### Part 1

How do the predictions of m1 and m2 relate. Why?

### Part 2

According to an "anova" comparison, is m1 significantly better than m0? Experimentally, if you increase "n" in the code above to 100, 150, and 200, does this relationship change? Why or why not?

# Application - Fishmermen data

The following questions are based on data regarding mercury in hair samples of fishermen and non-fishermen from Kuwait.

Dataset: fishermen_mercury.csv

Source: N.B. Al-Majed and M.R. Preston (2000). "Factors Influencing the Total
Mercury and Methyl Mercury in the Hair of Fishermen in Kuwait,"
Environmental Pollution, Vol. 109, pp. 239-250

Description: Factors related to mercury levels among fishermen and a control
group of non-fishermen.

Variables/names

Fisherman indicator (fisherman)

Age in years (age)

Residence Time in years (restime)

Height in cm (height)

Weight in kg (weight)

Fish meals per week ((fishmlwk)

Parts of fish consumed: 0=none, 1=muscle tissue only, 2=mt and sometimes
         whole fish, 3=whole fish (fishpart)

Methyl Mercury in mg/g (MeHg)

Total Mercury in mg/g (TotHg)

downloaded from http://users.stat.ufl.edu/~winner/datasets.html 4/23/2019

## Question 2: Regression models with interaction terms

### Part 1

Please specify a regression model where TotHg is predicted by age, weight, and their interaction.

```{r}

```

Please interpret the coefficient estimates for the intercept, age, and weight in the language of the problem. Is the interaction between these variables significant?

### Part 2

Please specify a regression model where TotHg is predicted by age, fishmerman, and their interaction. Be sure that fisherman is the proper data type for the analysis.

```{r}

```

Please interpret the coefficient estimates for the intercept, age, and fisherman in the language of the problem. Is the interaction between these variables significant?

## Question 3: Backward selection

Please perform backward selection on the model below of the log of total mercury on the demographic variables. Please treat "fishpart" and "fisherman" as categorical variables. Assess the diagnostic plots for the final model. For the final model, would you consider removing any outliers?

```{r}

```

## Question 4: Forward selection

Please perform forward selection on the model below of the log of total mercury on the demographic variables. Please treat "fishpart" and "fisherman" as categorical variables Assess the diagnostic plots for the final model. For the final model, would you consider removing any outliers?

```{r}

```

## Question 5: Best subsets selection

Please perform best subsets selection on the model below of the log of total mercury on the demographic variables. Please treat "fishpart" and "fisherman" as categorical variables Assess the diagnostic plots for the final model. For the final model, would you consider removing any outliers?

```{r}

```

SET 2 ## Question 1: Application The data in &quot;dat_40_da...

SET 2 ## Question 1: Application The data in "dat_40_da...