Question

Problem 1)
(a) Define an indicator variable ALL.fac such that ALL.fac=1 for T-cell patients and ALL.fac=2 for B-cell patients.
(b) Plot the histograms for the first three genes’ expression values in one row.
(c) Plot the pairwise scatterplots for the first five genes.
(d) Do a 3D scatterplot for the genes “39317_at”, “32649_at” and “481_at”, and color according to ALL.fac (give different colors for B-cell versus T-cell patients). Can the two patient groups be distinguished using these three genes?
(e) Do K-means clustering for K=2 and K=3 using the three genes in and compare the resulting clusters with the two patient groups. Are the two groups discovered by the clustering analysis?
(f) Carry out the PCA on the ALL data set with scaled variables. What proportion of variance is explained by the first principal component? By the second principal component?
(g) Do a biplot of the first two principal components. Observe the pattern for the loadings. What info is the first principal component summarizing?
(h) For the second principal component PC2, print out the three genes with biggest PC2 values and the three genes with smallest PC2 values.
(i) Find the gene names and chromosomes for the gene with biggest PC2 value and the gene with smallest PC2 value. (Hint: review Module 10 on searching the annotation.)

Problem 2)
Variables scaling and PCA in the iris data set In this module and last module, we mentioned that the variables are often scaled before doing the PCA or the clustering analysis. By “scaling a variable”, we mean to apply a linear transformation to center the observations to have mean zero and standard deviation one. In last module, we also mentioned using the correlationbased dissimilarity measure versus using the Euclidean distance in clustering analysis. It turns out that the correlation-based dissimilarity measure is proportional to the squared Euclidean distance on the scaled variables. We check this on the iris data set. And we compare the PCA on scaled versus unscaled variables for the iris data set.
(a) Create a data set consisting of the first four numerical variables in the iris data set (That is, to drop the last variable Species which is categorical). Then make a scaled data set that centers each of the four variables (columns) to have mean zero and variance one.
(b) Calculate the correlations between the columns of the data sets using the cor() function. Show that these correlations are the same for scaled and the unscaled data sets.
(c) Calculate the Euclidean distances between the columns of the scaled data set using dist() function. Show that the squares of these Euclidean distances are proportional to the (1-correlation)s. What is the value of the proportional factor here?
(d) Show the outputs for doing PCA on the scaled data set and on the unscaled data set. (Apply PCA on the two data sets with option “scale=FALSE”. Do NOT use option “scale=TRUE”, which will scale data no matter which data set you are using.) Are they the same?
(e) What proportions of variance are explained by the first two principle components in the scaled PCA and in the unscaled PCA?
(f) Find a 90% confidence interval on the proportion of variance explained by the second principal component, in the scaled PCA.

Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

# Problem 1
# (a)
library(multtest)
data(golub)
golub.gnames
library(genefilter)
library("ALL")
data(ALL) # 12635 gene expression values for 128 different persons

patientB <- factor(ALL$BT %in% c("B","B1","B2","B3","B4"))
patientT <- factor(ALL$BT %in% c("T","T1","T2","T3","T4"))
ALL.fac <- as.numeric(patientB)
# ALL.fac=1 for T-cell patients and ALL.fac=2 for B-cell patients.
# (b)

par(mfrow=c(1,3))
apply(exprs(ALL[1:3,]), 1, hist)
      
# (c)
pairs( t(exprs(ALL[1:5,])) ,col=ALL.fac)...

This is only a preview of the solution. Please use the purchase button to see the entire solution

Assisting Tutor

Related Homework Solutions

R Programming Problems
Homework Solution
$33.00
Mathematics
Statistics
R Programming
Baseball Players
Strikes
Scores
Samples
Information
Estimation
Functions
Countries
Standard Errors
Salary
Statistics Questions
Homework Solution
$78.00
Statistics
MLE Formula
Mathematics
Chi-Square Distribution
Degrees Of Freedom
Mean
Variance
Gene Expression
R codes
Samples
Functions
Monte Carlo Study
Parameter Values
Probability
Patterns
Data Driven Management
Homework Solution
$38.00
Statistics
Business
Salaries
Datasets
Budget
Finance
Mathematics
Labels
Graphs
R-Programming
Lahman Package
Functions
Statistics Questions
Homework Solution
$53.00
Mathematics
Statistics
Companies
Random Samples
Standard Deviation
Average Values
Confidence Intervals
Organizations
Functions
Statistics Questions
Homework Solution
$18.00
Statistics
Data Sets
Mathematics
Linear Regression
ANOVA
Test for Normality
Functions
R Programming
Normal Distribution
Variables
Codes
P-Values
Estimation
Coefficients
Get help from a qualified tutor
Live Chats