 # R Programming Problems

Subject Mathematics Statistics-R Programming

## Question

(1)
We analyze the data set NCI60 data from the ISLR library.
library(ISLR) ncidata<-NCI60\$data
(a) Delete the cancer types with only one or two cases (“K562A-repro”, etc.). Keep only the cancer types with more than 3 cases.
(b) Analyze the expression values of the first gene in the data (first column). Does the first gene express differently in different types of cancers? If so, in which pairs of cancer types does the first gene express differently? (Use FDR adjustment.)
(c) Check the model assumptions for analysis in part (b). Is ANOVA analysis appropriate here?
(d) Apply ANOVA analysis to each of the 6830 genes. At FDR level of 0.05, how many genes express differently among different types of cancer patients?

(2)
We analyze data for the B-cell patients in the ALL data set in the textbook.
(a) Select gene expression data for only the B-cell patients. The analysis in following parts will only use these gene expression data on the B-cell patients.
(b) Select only those genes whose coefficient of variance (i.e., standard deviation divided by the mean) is greater than 0.2. How many genes are selected?
(c) We wish to conduct clustering analysis to study natural groupings of the patients predicted by the gene expression profiles. For this analysis, we first need to reduce the number of genes studied. The filter in (b) is one such choice. Please comment on what filtering methods you would use to choose genes, other than the filter in (b). What would you consider as the best gene filter in this case.
(d) Conduct a hierarchical clustering analysis with filtered genes in (b). (For uniformity in grading, we ask everyone to use the filter in (b). It may not be your best filter in (c).) How do the clusters compare to the B-stages? How does do the clusters compare to the molecule biology types (in variable ALL\$mol.biol)? Provide the confusion matrices of the comparisons, with 4 clusters.
(e) Draw two heatmaps for the expression data in (d), one for each comparison. Using colorbars to show the comparison types (B-stages or molecule biology types). The clusters reflect which types better: B-stages or molecule biology types?
(f) We focus on predicting the B-cell differentiation in the following analysis. We merge the last two categories “B3” and “B4”, so that we are studying 3 classes: “B1”, “B2” and “B34”. (Ignore the unknown type “B” in the analysis.) Use linear model (limma library) to select genes that expresses differently among these three classes at FDR of 0.05. How many genes are selected?
(g) Fit SVM and the classification tree on these selected genes in part (f), evaluate their performance with delete-one-cross-validated misclassification rate.
(h) We select the genes passing both filters in (b) and (f). How many genes are selected? Redo part (g) on these genes passing both filters.
(i) Which classifier you will consider best among the classifiers studied in part (g) and part (h)? Why?

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

rm(list=ls())
# (1)
library(ISLR)
# data is a 64 by 6830 matrix of the expression values while labs is a vector listing the cancer
# types for the 64 cell lines. 6830 genes. So 1 gene has 64 cell lines.
ncidata <- NCI60\$data
dim(ncidata)
# 64 x 6830
str(ncidata)
labs <- NCI60\$labs
# length(labs) = 64.
labs
#  "CNS"         "CNS"         "CNS"         "RENAL"       "BREAST"      "CNS"         "CNS"         "BREAST"      "NSCLC"
#  "NSCLC"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"       "BREAST"
#  "NSCLC"       "RENAL"       "UNKNOWN"    "OVARIAN"    "MELANOMA"    "PROSTATE"    "OVARIAN"    "OVARIAN"    "OVARIAN"
#  "OVARIAN"    "OVARIAN"    "PROSTATE"    "NSCLC"       "NSCLC"       "NSCLC"       "LEUKEMIA"    "K562B-repro" "K562A-repro"
#  "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "COLON"       "COLON"       "COLON"       "COLON"
#  "COLON"       "COLON"       "COLON"       "MCF7A-repro" "BREAST"      "MCF7D-repro" "BREAST"      "NSCLC"       "NSCLC"
#  "NSCLC"       "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"
#  "MELANOMA"
table(labs)>2
cancer.types <- names(table(labs))[ table(labs)>2]
cancer.types
#  "BREAST"   "CNS"      "COLON"    "LEUKEMIA" "MELANOMA" "NSCLC"    "OVARIAN" "RENAL"
idx <- which( labs %in% cancer.types)
labs <- labs[idx]
ncidata <- ncidata[idx,]...

This is only a preview of the solution. Please use the purchase button to see the entire solution

## Related Homework Solutions

Statistics Problems \$38.00
Mathematics
Statistics
Probability
Performances
Courts
Cases
Raw Data
Percentage
Charts
Ranking
Judges
Statistics Questions \$75.00
Statistics
Question
Mathematics
R-Programming
Document
Strings
Function
Variables
Data
Set
Code
Frame
Exploring The Metropolis-Hastings Algorithm \$35.00
Mathematics
Statistics
R Programming
Algorithm
Gamma Distribution
Mean
Variance
Iterations
Chains
Functions
Columns
Histogram
Tables
Operations
Gibbs Sampler
Analysis
Statistics Project Using R Programming \$10.00
Mathematics
Statistics
R Programming
NBA Players
CSV Files
Functions
Operations
Plots
Players
Percentage
Distance
Values
Proportions
Shots
Statistics Questions \$75.00
Statistics
Mathematics
Values
Range
Quantity
Mean
Null Hypothesis
Data
Significance
Diseases
Tables
Surveys
Microsoft Excel
Statistics Problems \$30.00
Mathematics
Statistics
Volvo
Variance
Miles
Gallons
Rejection
Decisions
Null Hypothesis
Alternative Hypothesis
Agencies
Random Samples
Distributions
Live Chats