Question

(1)
We analyze the data set NCI60 data from the ISLR library.
library(ISLR) ncidata<-NCI60$data
(a) Delete the cancer types with only one or two cases (“K562A-repro”, etc.). Keep only the cancer types with more than 3 cases.
(b) Analyze the expression values of the first gene in the data (first column). Does the first gene express differently in different types of cancers? If so, in which pairs of cancer types does the first gene express differently? (Use FDR adjustment.)
(c) Check the model assumptions for analysis in part (b). Is ANOVA analysis appropriate here?
(d) Apply ANOVA analysis to each of the 6830 genes. At FDR level of 0.05, how many genes express differently among different types of cancer patients?

(2)
We analyze data for the B-cell patients in the ALL data set in the textbook.
(a) Select gene expression data for only the B-cell patients. The analysis in following parts will only use these gene expression data on the B-cell patients.
(b) Select only those genes whose coefficient of variance (i.e., standard deviation divided by the mean) is greater than 0.2. How many genes are selected?
(c) We wish to conduct clustering analysis to study natural groupings of the patients predicted by the gene expression profiles. For this analysis, we first need to reduce the number of genes studied. The filter in (b) is one such choice. Please comment on what filtering methods you would use to choose genes, other than the filter in (b). What would you consider as the best gene filter in this case.
(d) Conduct a hierarchical clustering analysis with filtered genes in (b). (For uniformity in grading, we ask everyone to use the filter in (b). It may not be your best filter in (c).) How do the clusters compare to the B-stages? How does do the clusters compare to the molecule biology types (in variable ALL$mol.biol)? Provide the confusion matrices of the comparisons, with 4 clusters.
(e) Draw two heatmaps for the expression data in (d), one for each comparison. Using colorbars to show the comparison types (B-stages or molecule biology types). The clusters reflect which types better: B-stages or molecule biology types?
(f) We focus on predicting the B-cell differentiation in the following analysis. We merge the last two categories “B3” and “B4”, so that we are studying 3 classes: “B1”, “B2” and “B34”. (Ignore the unknown type “B” in the analysis.) Use linear model (limma library) to select genes that expresses differently among these three classes at FDR of 0.05. How many genes are selected?
(g) Fit SVM and the classification tree on these selected genes in part (f), evaluate their performance with delete-one-cross-validated misclassification rate.
(h) We select the genes passing both filters in (b) and (f). How many genes are selected? Redo part (g) on these genes passing both filters.
(i) Which classifier you will consider best among the classifiers studied in part (g) and part (h)? Why?

Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

rm(list=ls())
# (1)
library(ISLR)
# data is a 64 by 6830 matrix of the expression values while labs is a vector listing the cancer
# types for the 64 cell lines. 6830 genes. So 1 gene has 64 cell lines.
ncidata <- NCI60$data
dim(ncidata)
# 64 x 6830
str(ncidata)
labs <- NCI60$labs
# length(labs) = 64.
labs
# [1] "CNS"         "CNS"         "CNS"         "RENAL"       "BREAST"      "CNS"         "CNS"         "BREAST"      "NSCLC"      
# [10] "NSCLC"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "BREAST"   
# [19] "NSCLC"       "RENAL"       "UNKNOWN"    "OVARIAN"    "MELANOMA"    "PROSTATE"    "OVARIAN"    "OVARIAN"    "OVARIAN"   
# [28] "OVARIAN"    "OVARIAN"    "PROSTATE"    "NSCLC"       "NSCLC"       "NSCLC"       "LEUKEMIA"    "K562B-repro" "K562A-repro"
# [37] "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "COLON"       "COLON"       "COLON"       "COLON"      
# [46] "COLON"       "COLON"       "COLON"       "MCF7A-repro" "BREAST"      "MCF7D-repro" "BREAST"      "NSCLC"       "NSCLC"      
# [55] "NSCLC"       "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"   
# [64] "MELANOMA"
table(labs)>2
cancer.types <- names(table(labs))[ table(labs)>2]
cancer.types
# [1] "BREAST"   "CNS"      "COLON"    "LEUKEMIA" "MELANOMA" "NSCLC"    "OVARIAN" "RENAL"
idx <- which( labs %in% cancer.types)
labs <- labs[idx]
ncidata <- ncidata[idx,]...

This is only a preview of the solution. Please use the purchase button to see the entire solution

Assisting Tutor

Related Homework Solutions

R Programming Problems
Homework Solution
$58.00
Statistics
Mathematics
Data Sets
Confidence Intervals
Predictors
Results
Functions
Codes
R Programming
ANOVA
Matrices
Classification
Contingency Table
Cluster Analysis
Biostatistics Questions: Principal Component Analysis
Homework Solution
$58.00
Biostatistics
Mathematics
Principal Component Analysis
Programming
Computer Science
Data
Iris Set
Matrices
Covariance
Normalization
Eigenvector
Variables
Summary
Commands
Codes
Sepal Shape
Documentation
Data Analysis and Manipulation
Homework Solution
$35.00
Statistics
Data Analysis
Data Sets
Linear Regression
ANOVA
Normality Test
Null Hypothesis
Alternative Hypothesis
Research
Functions
Tables
Statistical Analysis Using R-Programming
Homework Solution
$98.00
Statistical Analysis
Gene Expression Values
K-Means Cluster
Patient Groups
Biostatistics
R-Programming
Confidence Intervals
Estimation
Functions
Marginal Independence
Euclidean Distance
Applied Linear Regression Questions
Homework Solution
$40.00
Statistics
Mathematics
Linear Regression
Model Selection
R Programming
Functions
Predictions
Coefficients
Input
Output
Statistical Analysis Of A Real Biological Dataset
Homework Solution
$75.00
Statistical Analysis
Biological Dataset
Plasma Retinol
Functions
Plots
Categories
Fiber Consummation
Differences
Smoking Status
Assumptions
Confidence Intervals
Findings
Get help from a qualified tutor
Live Chats