Consider the social media data which measures introversion level and several other social media usage variables between undergraduate students and graduate students.

1. Lets explore the difference between student populations.

(a) Compare the distributions of each individual variable across populations using sample mean vectors, covariance matrices, and boxplots.

(b) Do you identify any outliers using boxplots? If so, how many standard deviations above or below each mean is each outlier?

(c) Compare the total variability across data sets using two different measures of total variability. Which class is more variable: UG or GR?
(d) Get a correlation matrix, biplot, and scatter plot matrix for each data set and use each to identify the pair of variables that are the most strongly correlated for UG and for GR students.

(e) Get the scree plots for each data set. Use them determine if analyses based on the biplots, such as the correlation analysis above, are reasonable.

(f) Compare the overall correlation across data sets using a single statistic for each data set.

(g) Get the first principal component for each data set.

i. Explain why you should perform the PCA on the scaled data and not the original data.

ii. Relate the difference in the loadings to the difference in the correlation matrices.

(h) Get boxplots and scatter plots for the first two PC scores for each data set. Circle any outliers outliers on your plots. Center and scale any outlier and use this information to explain how it/they differ from the rest of the data?

(i) Suppose that a centered and scaled observa- tion is (2, -2,-2, 2, 0). Based on the first 2 sets of loadings for the GR data set, should this observation be considered an outlier? Justify your answer.

2. For each data set determine if multivariate normality is reasonable using the following tools.

(a) Use Mardia’s tests to determine if the data are thin or fat tailed or skewed.

(b) Use biplots. Remember to assess this approach.

(c) Use QQ plots of the first and second set of PC scores. Assess this approach.

(d) Use QQ plots based on the Mahalanobis distances.

(e) Formally identify outliers using a Bonferroni 0.05/n quantile.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# Loading essential libraries

# Reading data
dat <- read.csv("StudentData.csv")
# Selecting numeric columns
df <- dat[, c('HSClass', 'TxtSent', 'TxtRec', 'Fbtime', 'Introvert')]

# Calculating mean of each variable
mean_all <- apply(df, 2, mean)

# Calculating covariance matrix
cov_mat <- cov(df)

# Plot box plot of the data
boxplot(df, main = "Box plot of each variable of the data")

# HSClass has the highest mean value while Introvert is the least

# HSClass has highly positive covriance with TxtSent and TxtRec

# Yes, we can see in the boxplot, there are two outliers one in HSClass at around 1200 and another
# one in TxtRec at 200.

# Plotting histogram to see outlier at the far right skewed tail.
hist(df$HSClass, xlab = "HSClass", main = "Distribution of HSClass")
hist(df$TxtRec, xlab = "TxtRec", main = "Distribution of TxtRec")...

By purchasing this solution you'll be able to access the following files:

50% discount

$35.00 $17.50
for this solution

or FREE if you
register a new account!

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Upload a file
Continue without uploading

We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats