In a randomized study (Storey, et al. 2005), clinicians administered endotoxin to four patients and placebo to four patients. For each patient, gene expression is recorded at time points between 0 and 24 hours (6 time points per patient). The files endotoxin_gene_exp.txt and endotoxin_info.txt contains this data for one particular gene. The goal of this experiment is to determine whether or not the gene expression profiles of patients over time differs between control and endotoxin treated patients. The null hypothesis is that these profiles do not differ. We will use regression to analyze this problem.

Problem 2, part a
We have two different covariates, class and time . class is clearly a factor. What are the advantages and disadvantages to treating time as a continuous variable or as a factor variable?

Problem 2, part b
Use linear regression and ANOVA to test the null hypothesis, and report a p-value.

Problem 2, part c
Continuing from part c, determine from your linear regression if the trajectories over time are different between the classes, or if only the mean expression levels are different.

Problem 2, part d
Use a natural cubic spline with a B-spline basis to test the null hypothesis, and report a p-value. Use the ns() function from the package splines in conjunction with lm() .

Problem 2, part e
Continuing from part d, and analogous to part c, determine from your natural cubic spline fit if the trajectories over time are different between classes, or if only the mean expression levels are different.
Problem 3: testing for genetic association

Recall from previous homeworks that single nucleotide polymorphisms (SNPs) in humans take the values 0, 1, and 2 and represent genetic variation. Consider the following genotype data for a single SNP in a case/control study for some disease. There are 600 patients (300 patients in each class):
Genotype 0 1 2
Has disease 111 143 46
No disease 161 117 22
We want to determine if this particular genetic variant is associated with this disease.

Problem 3, part a
Suppose that Hardy-Weinberg Equilibrium (HWE) holds. In statistical terms, this means that the SNPs can be modeled as the sum of two independent Bernoulli trials. Thus, the table of genotypes could be turned into a table of alleles:
Allele 0 1
Has disease 365 235
No disease 439 161
Obtain a measure of the effect of association, i.e. the log odds ratio of disease status as a function of the alleles, and compute a 95% confidence interval. Is the association between the disease status and the alleles significant at a $\alpha=0.05$ threshold?

Problem 3, part b
Suppose that HWE does not hold. We could go back to the original table and model the three genotypes separately as a categorical factor variable. Obtain the log odds ratios of disease status as a function of the genotypes, and compute 95% confidence intervals. Is the association between disease status and the genotypes significant at a $\alpha=0.05$ threshold?

Problem 3, part c
Instead of treating the genotype as a factor like in part b, we could try treating it as a numerical variable. Model the genotype as a continuous variable taking the integer values of 0, 1, and 2. Obtain the log odds ratio of disease status as a function of the genotypes, and compute a 95% confidence interval. Is the association between disease status and the genotypes significant at a $\alpha=0.05$ threshold?

Problem 3, part d
Use the likelihood ratio test to compare the model of part b with the model of part c, by using the anova() function specifying test as "Chisq" . What are the null and alternative hypotheses of this test? Interpret the results, especially in the context of a genetic model of dominance.

In a randomized study (Storey, et al. 2005), clinicians administere...