 # Statistical Analysis Using R-Programming

Subject Mathematics Statistics-R Programming

## Question

Problem 1:
Clustering analysis on the "CCND3 Cyclin D3" gene expression values of the Golub et al. (1999) data.
(a) Conduct hierarchical clustering using single linkage and Ward linkage. Plot the cluster dendrogram for both fit. Get two clusters from each of the methods. Use function table() to compare the clusters with the two patient groups ALL/AML. Which linkage function seems to work better here?
(b) Use k-means cluster analysis to get two clusters. Use table() to compare the two clusters with the two patient groups ALL/AML.
(c) Which clustering approach (hierarchical versus k-means) produce the best matches to the two diagnose groups ALL/AML?
(d) Find the two cluster means from the k-means cluster analysis. Perform a bootstrap on the cluster means. Do the confidence intervals for the cluster means overlap? Which of these two cluster means is estimated more accurately?
(e) Produce a plot of K versus SSE, for K=1, …, 30. How many clusters does this plot suggest?

Problem 2 :
Cluster analysis on part of Golub data.
(a) Select the oncogenes and antigens from the Golub data. (Hint: Use grep() ).
(b) On the selected data, do clustering analysis for the genes (not for the patients). Using K-means and K-medoids with K=2 to cluster the genes. Use table() to compare the resulting two clusters with the two gene groups oncogenes and antigens for each of the two clustering analysis.
(c) Use appropriate tests (from previous modules) to test the marginal independence in the two by two tables in (b). Which clustering method provides clusters related to the two gene groups?
(d) Plot the cluster dendrograms for this part of golub data with single linkage and complete linkage, using Euclidean distance.

Problem 3:
Clustering analysis on NCI60 cancer cell line microarray data (Ross et al. 2000) We use the data set in package ISLR from r-project (Not Bioconductor). You can use the following commands to load the data set.

install.packages('ISLR') library(ISLR)
ncidata<-NCI60\$data
ncilabs<-NCI60\$labs The ncidata (64 by 6830 matrix) contains 6830 gene expression measurements on 64 cancer cell lines. The cancer cell lines labels are contained in ncilabs. We do clustering analysis on the 64 cell lines (the rows).
(a) Using k-means clustering, produce a plot of K versus SSE, for K=1,…, 30. How many clusters appears to be there?
(b) Do K-medoids clustering (K=7) with 1-correlation as the dissimilarity measure on the data. Compare the clusters with the cell lines. Which types of cancer are well identified in a cluster? Which types of cancer are not grouped into a cluster? According to the clustering results, which types of cancer are most similar to ovarian cancer?

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

library("multtest")
data(golub)
dim(golub)
#  3051   38
golub <- data.frame(golub)
gol.fac <- factor( golub.cl, levels=0:1, labels=c("ALL","AML"))

# Problem 1
clusdata <- data.frame(golub[1042,])
clusdata <- as.vector(clusdata, mode="numeric")
hc.sing <- hclust( dist(clusdata,method="euclidian"),method="single")
hc.ward <- hclust( dist(clusdata,method="euclidian"),method="ward.D2")

plot(hc.sing, labels=gol.fac)
rect.hclust(hc.sing,k=2)
groups <- cutree(hc.sing,k=2)
table(groups, gol.fac)
#          gol.fac
# groups   ALL AML
#    1    27 10
#    2      0   1
# The table suggests that the algorithm groups all of them
# into one cluster for ALL except for one patient to AML which is
# actually a correct classification....

This is only a preview of the solution. Please use the purchase button to see the entire solution

## Related Homework Solutions

Statistics Questions \$30.00
Statistics
Mathematics
Probability
Charts
Coins
Variables
Standard Deviation
Experiments
Trials
Mean
Calculations
Statistics & R Programming Questions: Law of Large Numbers \$75.00
Statistics
R Programming
Large Numbers
Mathematics
Empirical Proportions
Random Data
Functions
Samples
Observations
Flexibility
Pie Charts
Theoretical Proportions
Variables
Statements
Conditions
Loops
Statistics Research Questions \$60.00
Statistics
Mathematics
Health
Data Sets
Research
Analysis
Employees
Work Performances
BMI
Covariance
Prediction
Models
Assumptions
R Programming Questions \$84.00
Computer Science
R Programming
Mathematics
Exam Scores
Data Sets
Distribution
Functions
Histogram
Frequency Polygon
Plots
Numerical Attributes
Interquartile Range
Mean
Median
Variance
Standard Deviation
Iris
Logistic Regression Problems \$45.00
Mathematics
Statistics
Logistic Regression
Weather Conditions
Social Scientists
Data
Models
Predictors
Equations
Deviance
Fit
Statistics Problems: World Health Organization \$40.00
Mathematics
Statistics
World Health Organization
Data Repository
Diseases
Histograms
Perspectives
Hypothesis
Visualizing Tools
R Programming
Live Chats