1. Analysis of the ALL data set.
(a) Define an indicator variable IsB such that IsB=TRUE for B-cell patients and IsB=FALSE for T-cell patients.
(b) Use two genes "39317_at" and "38018_g_at" to fit a classification tree for IsB. Print out the confusion matrix. Plot ROC curve for the tree.
(c) Find its empirical misclassification rate (mcr), false negative rate (fnr) and specificity. Find the area under curve (AUC) for the ROC curve.
(d) Use 10-fold cross-validation to estimate its real false negative rate (fnr). What is your estimated fnr? (e) Do a logistic regression, using genes "39317_at" and "38018_g_at" to predict IsB. Find an 80% confidence interval for the coefficient of gene "39317_at".
(f) Use n-fold cross-validation to estimate misclassification rate (mcr) of the logistic regression classifier. What is your estimated mcr?
(g) Conduct a PCA on the scaled variables of the whole ALL data set (NOT just the two genes used above). We do this to reduce the dimension in term of genes (so this PCA should be done on the transpose of the matrix of expression values). To simply our future analysis, we use only the first K principal components (PC) to represent the data. How many PCs should be used? Explain how you arrived at your conclusion. Provide graphs or other R outputs to support your choice.
(h) Do a SVM classifier of IsB using only the first five PCs. (The number K=5 is fixed so that we all use the same classifier. You do not need to choose this number in the previous part (g).) What is the sensitivity of this classifier?
(i) Use leave-one-out cross-validation to estimate misclassification rate (mcr) of the SVM classifier. Report your estimate.
(j) If you had to choose between classifiers in part (e) and in part (h), which one would you choose? Why?
2. Choosing Classifiers and Number of Principal Components for PCA reduced iris data set. In the last example of this module, we compared three classifiers on the iris data by working on the first three principal components. We choose the best classifiers based on cross-validated misclassification rate. We can also choose the number of principal components to use by cross-validation, instead of fixing it at K=3. Use the leave-one-out cross-validation to choose the number of principal components together with the classifier. Please report the empirical misclassification rates (on whole data set) and the leave-one-out cross-validation misclassification rates for each value of K=1, 2, 3, 4 principal components and for each of the three classifiers: logistic regression, support vector machine and classification tree. Based on those rates, what is your choice? Note: when you fit the logistic regression with K=1 principal component, then the PC1 becomes a vector instead of a matrix. You will need to modify the code for logistic regression for K=1 differently from the other values of K=2, 3, 4.
This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.
# (a) Define an indicator variable IsB such that IsB=TRUE for B-cell patients and IsB=FALSE for T-cell patients.
allBT <- ALL[, which(ALL$BT %in% levels(ALL$BT))]
IsB <- (ALL$BT %in% c("B","B1","B2","B3","B4"))
length(IsB) # 128
# (b) Use two genes "39317_at" and "38018_g_at" to fit a classification tree for IsB. Print out the confusion matrix.
# Plot ROC curve for the tree.
prob.names <- c("39317_at", "38018_g_at")
expr.data <- exprs(allBT)[prob.names,]
# 2 x 128
c.tr <- rpart(as.factor(IsB)~., data=data.frame(t(expr.data)))
rpartpred <- predict(c.tr, type="class")...
This is only a preview of the solution. Please use the purchase button to see the entire solution