## Question

1. Lets explore the difference between student populations.

(a) Compare the distributions of each individual variable across populations using sample mean vectors, covariance matrices, and boxplots.

(b) Do you identify any outliers using boxplots? If so, how many standard deviations above or below each mean is each outlier?

(c) Compare the total variability across data sets using two different measures of total variability. Which class is more variable: UG or GR?

(d) Get a correlation matrix, biplot, and scatter plot matrix for each data set and use each to identify the pair of variables that are the most strongly correlated for UG and for GR students.

(e) Get the scree plots for each data set. Use them determine if analyses based on the biplots, such as the correlation analysis above, are reasonable.

(f) Compare the overall correlation across data sets using a single statistic for each data set.

(g) Get the first principal component for each data set.

i. Explain why you should perform the PCA on the scaled data and not the original data.

ii. Relate the difference in the loadings to the difference in the correlation matrices.

(h) Get boxplots and scatter plots for the first two PC scores for each data set. Circle any outliers outliers on your plots. Center and scale any outlier and use this information to explain how it/they differ from the rest of the data?

(i) Suppose that a centered and scaled observa- tion is (2, -2,-2, 2, 0). Based on the first 2 sets of loadings for the GR data set, should this observation be considered an outlier? Justify your answer.

2. For each data set determine if multivariate normality is reasonable using the following tools.

(a) Use Mardia’s tests to determine if the data are thin or fat tailed or skewed.

(b) Use biplots. Remember to assess this approach.

(c) Use QQ plots of the first and second set of PC scores. Assess this approach.

(d) Use QQ plots based on the Mahalanobis distances.

(e) Formally identify outliers using a Bonferroni 0.05/n quantile.

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# Loading essential librarieslibrary(ggplot2)

#library(plyr)

library(psych)

library(dplyr)

library(corrplot)

library(MVN)

library(car)

# Reading data

dat <- read.csv("StudentData.csv")

# Selecting numeric columns

df <- dat[, c('HSClass', 'TxtSent', 'TxtRec', 'Fbtime', 'Introvert')]

# Calculating mean of each variable

mean_all <- apply(df, 2, mean)

# Calculating covariance matrix

cov_mat <- cov(df)

# Plot box plot of the data

boxplot(df, main = "Box plot of each variable of the data")

# HSClass has the highest mean value while Introvert is the least

print(mean_all)

# HSClass has highly positive covriance with TxtSent and TxtRec

print(cov_mat)

# Yes, we can see in the boxplot, there are two outliers one in HSClass at around 1200 and another

# one in TxtRec at 200.

# Plotting histogram to see outlier at the far right skewed tail.

hist(df$HSClass, xlab = "HSClass", main = "Distribution of HSClass")

hist(df$TxtRec, xlab = "TxtRec", main = "Distribution of TxtRec")...

By purchasing this solution you'll be able to access the following files:

Solution.R.