 # R Programming Problems

Subject Mathematics Statistics-R Programming

## Question

(1)
We analyze the data set NCI60 data from the ISLR library.
library(ISLR) ncidata<-NCI60\$data
(a) Delete the cancer types with only one or two cases (“K562A-repro”, etc.). Keep only the cancer types with more than 3 cases.
(b) Analyze the expression values of the first gene in the data (first column). Does the first gene express differently in different types of cancers? If so, in which pairs of cancer types does the first gene express differently? (Use FDR adjustment.)
(c) Check the model assumptions for analysis in part (b). Is ANOVA analysis appropriate here?
(d) Apply ANOVA analysis to each of the 6830 genes. At FDR level of 0.05, how many genes express differently among different types of cancer patients?

(2)
We analyze data for the B-cell patients in the ALL data set in the textbook.
(a) Select gene expression data for only the B-cell patients. The analysis in following parts will only use these gene expression data on the B-cell patients.
(b) Select only those genes whose coefficient of variance (i.e., standard deviation divided by the mean) is greater than 0.2. How many genes are selected?
(c) We wish to conduct clustering analysis to study natural groupings of the patients predicted by the gene expression profiles. For this analysis, we first need to reduce the number of genes studied. The filter in (b) is one such choice. Please comment on what filtering methods you would use to choose genes, other than the filter in (b). What would you consider as the best gene filter in this case.
(d) Conduct a hierarchical clustering analysis with filtered genes in (b). (For uniformity in grading, we ask everyone to use the filter in (b). It may not be your best filter in (c).) How do the clusters compare to the B-stages? How does do the clusters compare to the molecule biology types (in variable ALL\$mol.biol)? Provide the confusion matrices of the comparisons, with 4 clusters.
(e) Draw two heatmaps for the expression data in (d), one for each comparison. Using colorbars to show the comparison types (B-stages or molecule biology types). The clusters reflect which types better: B-stages or molecule biology types?
(f) We focus on predicting the B-cell differentiation in the following analysis. We merge the last two categories “B3” and “B4”, so that we are studying 3 classes: “B1”, “B2” and “B34”. (Ignore the unknown type “B” in the analysis.) Use linear model (limma library) to select genes that expresses differently among these three classes at FDR of 0.05. How many genes are selected?
(g) Fit SVM and the classification tree on these selected genes in part (f), evaluate their performance with delete-one-cross-validated misclassification rate.
(h) We select the genes passing both filters in (b) and (f). How many genes are selected? Redo part (g) on these genes passing both filters.
(i) Which classifier you will consider best among the classifiers studied in part (g) and part (h)? Why?

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

rm(list=ls())
# (1)
library(ISLR)
# data is a 64 by 6830 matrix of the expression values while labs is a vector listing the cancer
# types for the 64 cell lines. 6830 genes. So 1 gene has 64 cell lines.
ncidata <- NCI60\$data
dim(ncidata)
# 64 x 6830
str(ncidata)
labs <- NCI60\$labs
# length(labs) = 64.
labs
#  "CNS"         "CNS"         "CNS"         "RENAL"       "BREAST"      "CNS"         "CNS"         "BREAST"      "NSCLC"
#  "NSCLC"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "BREAST"
#  "NSCLC"       "RENAL"       "UNKNOWN"    "OVARIAN"    "MELANOMA"    "PROSTATE"    "OVARIAN"    "OVARIAN"    "OVARIAN"
#  "OVARIAN"    "OVARIAN"    "PROSTATE"    "NSCLC"       "NSCLC"       "NSCLC"       "LEUKEMIA"    "K562B-repro" "K562A-repro"
#  "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "COLON"       "COLON"       "COLON"       "COLON"
#  "COLON"       "COLON"       "COLON"       "MCF7A-repro" "BREAST"      "MCF7D-repro" "BREAST"      "NSCLC"       "NSCLC"
#  "NSCLC"       "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"
#  "MELANOMA"
table(labs)>2
cancer.types <- names(table(labs))[ table(labs)>2]
cancer.types
#  "BREAST"   "CNS"      "COLON"    "LEUKEMIA" "MELANOMA" "NSCLC"    "OVARIAN" "RENAL"
idx <- which( labs %in% cancer.types)
labs <- labs[idx]
ncidata <- ncidata[idx,]...

This is only a preview of the solution. Please use the purchase button to see the entire solution

## Related Homework Solutions

Data Analysis Using R-Programming \$65.00
Statistics
R Programming
Mathematics
Computer Science
Data Analysis
Results
Layman's Terms
Figures
Codes
Assumptions
Dummy Variables in Econometrics \$30.00
Statistics
Financial Mathematics
Econometrics
Dummy Variables
Factors
Observations
Categorical Variables
Size
Area
Data Sets
Regression
R Programming
Input
Output
Libraries
Tables
Plots
Notes
Customers
Statistics & R Programming Questions \$80.00
Statistics
Mathematics
R Programming
Samples
Data
MSE
Functions
Residuals
Comparison
Homogeneous Variance
ANOVA
Observations
Relative Efficiency
Factors
Statistics - R Programming Problems \$63.00
Statistics
Mathematics
R-Programming
Computer Science
Codes
Data Sets
Classification Tree
ROC Curve
Logistic Regression
Matrix
Expression Values
Sensitivity
Support Vector Machine
Functions
Statistical Analysis Using R-Programming \$98.00
Statistical Analysis
Gene Expression Values
K-Means Cluster
Patient Groups
Biostatistics
R-Programming
Confidence Intervals
Estimation
Functions
Marginal Independence
Euclidean Distance
R Programming Problems \$30.00
Statistics
Mathematics
R Programming
Temperatures
Data
Probability
Correlation
Research
Results
Tables
Functions
Live Chats