1- Cluster analysis on the ”Zyxin” expression values of the Golub et al. (1999) data.

(a) Produce a chatter plot of the gene expression values using showing different symbols for the two groups.

(b) Use single linkage cluster analysis to see whether the three indicates two different groups.

(c) Use k-means cluster analysis. Are the two clusters according to the diagnosis of the patient groups?

(d) Perform a bootstrap on the cluster means. You will have to modify the code here and there. Do the confidence intervals for the cluster means overlap?

2 - Close to CCND3 Cyclin D3. Recall that we did various analysis on the expression data of the CCND3 Cyclin D3 gene of the Golub (1999) data.

(a) Use gene filter to find the ten closed genes to the expression values of CCND3 Cyclin D3. Give their probe as well as their biological names.

(b) Produce of combined boxplot separately for the ALL and the AML expression values. Compare it with that on the basis of CCND3 Cyclin D3 and comment of the similarities.

(c) Compare the smallest distances with those among the Cyclingenes computed above. What is your conclusion?

3 - MCM3. In the example on MCM3 a plot shows that there is an outlier.

(a) Plot the data and invent a manner to find the row number of the outlier.

(b) Remove the outlier, test the correlation coefficient. Compare the results to those above.

(c) Perform the bootstrap to construct a confidence interval.

4 - Cluster analysis on part of Golub data.

(a) Select the oncogenes from the Golub data and plot the tree from a single linkage cluster analysis.

(b) Do you observe meaningful clusters.

(c) Select the antigenes and answer the same questions.

(d) select the receptor genes and answer the same questions.

5 - Principal Components Analysis on part of the ALL data.

(a) Construct an expression set with the patients with B-cell in stage B1, B2, and B3. Compute the corresponding ANOVA p-values of all gene expressions. Construct the expression set with the p- values smaller than 0.001. Report the dimensionality of the data matrix with gene expressions.

(b) Are the correlations between the patients positive?

(c) Compute the eigenvalues of the correlation matrix. Report the largest five. Are the first three larger than one?

(d) Program a bootstrap of the largest five eigenvalues. Report the bootstrap 95% confidence intervals and draw relevant conclusions.

(e) Plot the genes in a plot of the first two principal components.

**Subject Mathematics Statistics-R Programming**