 Statistical Analysis Using R-Programming

Subject Mathematics Statistics-R Programming

Question

Problem 1:
Clustering analysis on the "CCND3 Cyclin D3" gene expression values of the Golub et al. (1999) data.
(a) Conduct hierarchical clustering using single linkage and Ward linkage. Plot the cluster dendrogram for both fit. Get two clusters from each of the methods. Use function table() to compare the clusters with the two patient groups ALL/AML. Which linkage function seems to work better here?
(b) Use k-means cluster analysis to get two clusters. Use table() to compare the two clusters with the two patient groups ALL/AML.
(c) Which clustering approach (hierarchical versus k-means) produce the best matches to the two diagnose groups ALL/AML?
(d) Find the two cluster means from the k-means cluster analysis. Perform a bootstrap on the cluster means. Do the confidence intervals for the cluster means overlap? Which of these two cluster means is estimated more accurately?
(e) Produce a plot of K versus SSE, for K=1, …, 30. How many clusters does this plot suggest?

Problem 2 :
Cluster analysis on part of Golub data.
(a) Select the oncogenes and antigens from the Golub data. (Hint: Use grep() ).
(b) On the selected data, do clustering analysis for the genes (not for the patients). Using K-means and K-medoids with K=2 to cluster the genes. Use table() to compare the resulting two clusters with the two gene groups oncogenes and antigens for each of the two clustering analysis.
(c) Use appropriate tests (from previous modules) to test the marginal independence in the two by two tables in (b). Which clustering method provides clusters related to the two gene groups?
(d) Plot the cluster dendrograms for this part of golub data with single linkage and complete linkage, using Euclidean distance.

Problem 3:
Clustering analysis on NCI60 cancer cell line microarray data (Ross et al. 2000) We use the data set in package ISLR from r-project (Not Bioconductor). You can use the following commands to load the data set.

install.packages('ISLR') library(ISLR)
ncidata<-NCI60\$data
ncilabs<-NCI60\$labs The ncidata (64 by 6830 matrix) contains 6830 gene expression measurements on 64 cancer cell lines. The cancer cell lines labels are contained in ncilabs. We do clustering analysis on the 64 cell lines (the rows).
(a) Using k-means clustering, produce a plot of K versus SSE, for K=1,…, 30. How many clusters appears to be there?
(b) Do K-medoids clustering (K=7) with 1-correlation as the dissimilarity measure on the data. Compare the clusters with the cell lines. Which types of cancer are well identified in a cluster? Which types of cancer are not grouped into a cluster? According to the clustering results, which types of cancer are most similar to ovarian cancer?

Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

library("multtest")
data(golub)
dim(golub)
#  3051   38
golub <- data.frame(golub)
gol.fac <- factor( golub.cl, levels=0:1, labels=c("ALL","AML"))

# Problem 1
clusdata <- data.frame(golub[1042,])
clusdata <- as.vector(clusdata, mode="numeric")
hc.sing <- hclust( dist(clusdata,method="euclidian"),method="single")
hc.ward <- hclust( dist(clusdata,method="euclidian"),method="ward.D2")

plot(hc.sing, labels=gol.fac)
rect.hclust(hc.sing,k=2)
groups <- cutree(hc.sing,k=2)
table(groups, gol.fac)
#          gol.fac
# groups   ALL AML
#    1    27 10
#    2      0   1
# The table suggests that the algorithm groups all of them
# into one cluster for ALL except for one patient to AML which is
# actually a correct classification....

This is only a preview of the solution. Please use the purchase button to see the entire solution

Related Homework Solutions

R Programming Problems \$30.00
Statistics
Mathematics
R Programming
Temperatures
Data
Probability
Correlation
Research
Results
Tables
Functions
Statistics-R Programming Questions \$30.00
Mathematics
Statistics
R Programming
Variables
Probability
Samples
Normal Distribution
Comparison
Observations
Null Hypothesis
Population
Variance
Standard Deviation
Type Error
Numerators
Critical Values
Density Lines
Vectors
Plots
Degr
R Programming Problems \$43.00
Mathematics
Statistics
Computer Science
R Programming
Smokers
Studies
Codes
Surveys
F-Tests
Plots
Commands
Regression Variables
Libraries
Regions
Statistics & R Programming Questions \$80.00
Statistics
Mathematics
R Programming
Samples
Data
MSE
Functions
Residuals
Comparison
Homogeneous Variance
ANOVA
Observations
Relative Efficiency
Factors
Linear Regression Questions \$48.00
Mathematics
Statistics
Linear Regression
Samples
P-Values
Probability
Injuries
Tables
Predictors
Variances
Null Hypothesis
Significance Level
Live Chats