## Question

library(devtools) install_github("jdstorey/qvalue")

Problem 1: Modeling overdispersion in RNA-seq

We’ve already encountered RNA sequencing as a method for measuring gene expression. After processing, the data is often presented as nonnegative integer counts for each gene. Instinctively, we would hope to be able to use a Poisson distribution to model gene expression for each gene, but in practice, we often observe a phenomenon known as overdispersion. Overdispersion is where the mean-variance relationship for the Poisson distribution is broken, and the sample variance exceeds the sample mean. We would like to explore the sources of overdispersion using a carefully designed experiment.

We use a subset of data from an experiment published in Robinson, Wang, and Storey (2015). In this experiment, a common model organism S. cerevisiae (bakers yeast) was grown in a nested factorial fashion in order to capture variation introduced at each step. The steps are as follows:

1. Condition: a colony of yeast was split into two cultures, one grown in glucose limiting medium and the other in ethanol limiting medium. The two colonies are referred to by glucose and ethanol, respectively.

2. Biological replication: yeast growing in each culture was sampled twice, at diﬀerent times.

3. Preparation eﬀects: the four samples from the previous two steps were prepared twice for RNA sequencing using the sample protocol.

4. Lane eﬀects: the eight samples from the previous three steps were prepared and sequenced in two separate lanes, for a total of 16 samples.

See Figure 1 in the manuscript for a graphical depiction of the experiment. Since the samples were prepared in this nested fashion, we hope to model the contributions to the observed overdispersion in each step of the experiment.

The data is available in the file yeast_rnaseq.txt. There are 6,000 genes. There is a header containing sample names, which identify the experimental design variable values. The file experimental_design.txt contains a mapping between the sample names and vectors showing which part of the experimental design each sample was taken from.

Some hints:

1. You may find it helpful to tidy your data.

2. We’ve mentioned the package broom in the past, but we haven’t had an exercise that uses it. You may find it very useful here, since you will be fitting many models simultaneously. It will tidy the output of lm() and glm() and plays well with do() from dplyr.

Problem 1, part a: Observed overdispersion

Plot the method of moments variance estimates versus the maximum likelihood variance estimates for each gene in this dataset. You should observe overdispersion in that the methods of moments variance estimates are larger than the MLE variances, and this deviation increases as a function of observed mean expression of each gene. Consider carefully the scale of the axes when making this plot.

Problem 1, part b: Model matrix for the experimental design

One strategy to model the sources of overdispersion is to fit a Poisson GLM implementing the experimental design in terms of explanatory variables. Use the R function model.matrix to implement the experimental design as numerical explanatory variables, with particular attention to the nested nature of the experiment. A big hint: the answer is in Figure S9 of the manuscript. Your use of model.matrix should use standard R model formula syntax (e.g., ~ age + weight + age:weight) and the output should match Figure S11. Since we are giving you the answer, we won’t accept solutions that just hard code the values.

Problem 1, part c: Poisson GLM

For each gene, fit a Poisson GLM with the gene expression as the response with an explanatory variables model as determined in part b. Then, extract the fitted Poisson fitted parameter for each observation. Treating the fitted paramaters values as MLE estimate of the Poisson variance accounting for the experimental design, remake the figure from part a, replacing the old MLE estimate with the Poisson GLM estimate. You should observe that the overdispersion is greatly reduced.

Problem 1, part d: explanation

Provide a brief explanation as to why the Poisson GLM model fit reduces apparent overdispersion. You do not need to do a lot of technical work here. We are looking for a brief argument that uses the statistical principles.

Problem 2: GEUVADIS data set

For this problem, we consider a simplified version of expression quantitative trait loci (eQTL) analysis. First, there is genome-wide gene expression data. Second, there is also corresponding genetic variation data (in our case, SNP genotypes). The goal is to identify genetic variants (SNPs) that are associated with the gene expression values. The hope is that detecting this type of association provides evidence that these variants are involved in the regulation of gene expression.

The data for this problem is sourced from the GEUVADIS project (http://www.geuvadis.org/web/geuvadis). We have provided a tiny subset of it consisting of gene expression values for the gene SHANK2, as well as genotypes for 1,000 SNPs. There are 458 individuals in this study.

There are two data files.

1. geuvadis_genotypes.txt: This contains the genotypes. There are no row names, but we recommend simply numbering the SNPs from 1 to 1000. The column header identifies the samples.

2. geuvadis_gene_exp.txt: This contains gene expression data as well as information about the samples.

Some general hints:

1. For the many responses models, building a combined tidy data.frame of all the data could help since that’ll give you easy access to the do() function.

2. However, for the LASSO, tidy format might not help by default since glmnet() takes vectors/matrices as input. You might want to try the package glmnetUtils.

3. Be very careful about the organization of the data. If the genotypes and expression values don’t match up for the same samples, the analysis won’t work! If you find yourself needing to change the order of a vector to match another vector, the base function match() will help.

Problem 2, part a: Exploring systematic variation in the data

Use PCA to explore systematic variation among the genotype variables by making biplots. We give you two additional pieces of information about each sample: the sample’s population and also the lab where the data was generated. Further, make a scree plot showing the proportion of variation explained by each PC.

Problem 2, part b: Transforming the response

Let’s examine the distribution of the gene expression values for SHANK2 to motivate a log transformation. Plot the distribution of the raw gene expression. Then, plot the distribution of the log transformed data. Note that since there are zeroes, you will need to use a pseudocount, i.e., add something to the argument of the log. One option is doing log(gene expression + 1), since this maps 0 to 0. Explain if the plots of these distributions visually justify the log transformation. Use the transformed gene expression for the rest of the problem.

Problem 2, part c: Accounting for variation in gene expression

Determine if there are diﬀerences in mean (transformed) gene expression, when the expression is modeled in terms of population or lab. First, assess this visually. Then, assess this using statistical inference (i.e., compute p-values using appropriate hypothesis tests).

Problem 2, part d: Multiple testing I

For each SNP, perform a test of association by fitting a simple linear model of the (transformed) gene expression as a function of the SNP.

Then, aggregate the p-values across SNPs and plot a p-value histogram. Finally, compute πˆ0 (estimated proportion of null tests) and identify which SNPs are significant at threshold corresponding to a false discovery rate (FDR) estimate ≤ 0.05, using the qvalue package. We will refer to this analysis as the “unadjusted analysis” in future parts of this problem.

Problem 2, part e: Multiple testing II

Repeat part d, except include as covariates the variables population and lab to adjust for possible confounding factors. Make sure to report the same results: histogram of p-values, πˆ0, and which SNPs are significant at FDR ≤ 0.05. How do the results compare to those of part d?

Problem 2, part f: Multiple regressors with LASSO

Perform LASSO regression using all the SNPs as explanatory variables and the (transformed) gene expression as the response variable. We recommend using the glmnet package. Don’t forget that LASSO regression has a tuning parameter, usually called λ, which you should determine using cross-validation. Since cross-validation is a random process, please set a seed! Plot the cross validation results (i.e., prediction error as a function of λ). Report which SNPs have non-zero coeﬃcients at the chosen value of λ. Is this consistent with the unadjusted analysis in part d? Explain why or why not.

Hint: Use lambda.min instead of lambda.1se. The signal in this analysis is too weak to shrink away coeﬃcients aggressively.

Problem 2, part g: Modeling many responses

Now, use logistic regression to perform a test of association for each SNP, treating genotype as a response variable modeled in ters of the transformed gene expression as an explanatory variable. Do your results agree with the unadjusted test for association in part d? Provide a theoretical justification for what you observe. We are not asking for a lot of technical work or algebra, but rather a demonstration of an understanding of the concepts.

Always include the session information.

sessionInfo()

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

By purchasing this solution you'll be able to access the following files:

Solution.docx and Solution.R.