## Question

Problem 2, part a

We have two different covariates, class and time . class is clearly a factor. What are the advantages and disadvantages to treating time as a continuous variable or as a factor variable?

Problem 2, part b

Use linear regression and ANOVA to test the null hypothesis, and report a p-value.

Problem 2, part c

Continuing from part c, determine from your linear regression if the trajectories over time are different between the classes, or if only the mean expression levels are different.

Problem 2, part d

Use a natural cubic spline with a B-spline basis to test the null hypothesis, and report a p-value. Use the ns() function from the package splines in conjunction with lm() .

Problem 2, part e

Continuing from part d, and analogous to part c, determine from your natural cubic spline fit if the trajectories over time are different between classes, or if only the mean expression levels are different.

Problem 3: testing for genetic association

Recall from previous homeworks that single nucleotide polymorphisms (SNPs) in humans take the values 0, 1, and 2 and represent genetic variation. Consider the following genotype data for a single SNP in a case/control study for some disease. There are 600 patients (300 patients in each class):

Genotype 0 1 2

Has disease 111 143 46

No disease 161 117 22

We want to determine if this particular genetic variant is associated with this disease.

Problem 3, part a

Suppose that Hardy-Weinberg Equilibrium (HWE) holds. In statistical terms, this means that the SNPs can be modeled as the sum of two independent Bernoulli trials. Thus, the table of genotypes could be turned into a table of alleles:

Allele 0 1

Has disease 365 235

No disease 439 161

Obtain a measure of the effect of association, i.e. the log odds ratio of disease status as a function of the alleles, and compute a 95% confidence interval. Is the association between the disease status and the alleles significant at a $\alpha=0.05$ threshold?

Problem 3, part b

Suppose that HWE does not hold. We could go back to the original table and model the three genotypes separately as a categorical factor variable. Obtain the log odds ratios of disease status as a function of the genotypes, and compute 95% confidence intervals. Is the association between disease status and the genotypes significant at a $\alpha=0.05$ threshold?

Problem 3, part c

Instead of treating the genotype as a factor like in part b, we could try treating it as a numerical variable. Model the genotype as a continuous variable taking the integer values of 0, 1, and 2. Obtain the log odds ratio of disease status as a function of the genotypes, and compute a 95% confidence interval. Is the association between disease status and the genotypes significant at a $\alpha=0.05$ threshold?

Problem 3, part d

Use the likelihood ratio test to compare the model of part b with the model of part c, by using the anova() function specifying test as "Chisq" . What are the null and alternative hypotheses of this test? Interpret the results, especially in the context of a genetic model of dominance.

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# Converniting the data in long formatlibrary(tidyr)

endotoxin_gene_exp_long<-gather(endotoxin_gene_exp,key = class,value = Values, 2:9)

endotoxin_gene_exp_long$class_org<-ifelse(endotoxin_gene_exp_long$class %in% c("sample_1","sample_2","sample_3","sample_4"),"Endotoxin","Control")

endotoxin_gene_exp_long<-na.omit(endotoxin_gene_exp_long)

# Problem 2

# We have two different covariates, class and time . class is clearly a factor. What are the advantages and disadvantages to treating time as a continuous variable or as a factor variable?

# The advanage of using time as a factor is that the data is collected in hour of the day format and there can be only 24 hours which should not be treated as continous variables, but by using time as factor we will not be able to assess gene expression profiles of patients for the times other than 2,4,6,9 and 24.

# but if we treat them as continous variable then the basic advantage is that we can asses the gene expression profiles of patients over a the time other than the predefined time in the dataset. For each sample time is from 0 to 24 hour and for either of the sample time can not have any other value than 0 to 24 so we should consider it as a factor variable.

#clopper pearson()

# Part b

# Use linear regression and ANOVA to test the null hypothesis, and report a p-value.

# Let's fit a regression model to see the behaviour of the profiles

FitRegression<-lm(Values~as.factor(time)+class_org,data=endotoxin_gene_exp_long)

summary(FitRegression)

# We can see from the above results that the model is giving a significant p-value of 2.425e-09 at f-Statistic 16.46, which means that the model is good to predict the variable under study. Also all the p-Values are coming to be significant except for the one i.e. the 24th hour time

# Now let's proceed with the ANOVA

FitANOVA<-aov(Values~as.factor(time)+class_org,data=endotoxin_gene_exp_long)

summary(FitANOVA)...

By purchasing this solution you'll be able to access the following files:

Solution.docx and Solution.R.