Please use R Markdown for your submission. Include the following files:
Your Rmd file.
The compiled/knitted HTML document.
The knitted document should be clear, well-formatted, and contain all relevant R code, output,
explanations. R code style should follow the Tidyverse style guide: https://style.tidyverse.org/
Collaboration must adhere to Level 1 collaboration described in the Stats 20 Collaboration Policy.
Note: All questions on this homework should be done using only functions or syntax discussed
in the lecture notes. No credit will be given for use of outside functions.
The following information is used in Questions 1 and 2.
births.csv file on CCLE contains data on a sample of babies born in North Carolina. We are interested
in investigating the association between baby weight and the mother's smoking habit (smokers versus
Read the dataset into R and save it to the workspace. Verify that the data has been loaded correctly.
The Habit variable contains the smoker status for the mother of each baby in the sample. There are several
observations in the data for which there is no information about the mother's smoker status. How many
observations are in this category? What is the name of the category (level) in the Habit variable for these
Create a bar plot of the mother's smoker status in the data. Change the color of the bars from the default
grey color to a different color of your choice.
droplevels() function drops unused levels from a factor or from factors in a data frame. Extract the
observations in the data for which we know the mother's smoker status (either smoker or non-smoker) and
drop the unused level from the Habit variable. Save this subset of the data as a separate data frame in the
workspace. The subset of the data should contain all of the variables from the original data.
Using the data frame from (d), create side-by-side boxplots of the weight variable split by the mother's
Create overlapping relative frequency histograms for the distributions of weight split by the mother's smoker
status (either smoker or non-smoker). Be sure that the plot satisifies the following criteria:
Change the x-label and main title to be more informative.
Use different colors for the two histograms. Change the density of the shading so that the overlap
between the histograms is visible.
Superimpose density curves of the data, matching the color of the curves to the corresponding histogram.
Add vertical lines that show the median weights for each distribution.
Add a legend that helps understand the various colors/components of the plot.
on the plot, do you think there is a significant difference between the typical weight of a baby born to
a mother who smokes and the typical weight of a baby born to a mother who does not smoke?
Hint: For legends with mixed types of symbols (points, lines, boxes, etc.), the pch, lty, density, and border
(and other) arguments use NA to exclude those arguments from modifying the corresponding entries in the
(fill uses 0 instead of NA). For example, a legend with two box entries and one line entry could have
arguments density = c(20, 30, NA), border = c(1, 1, NA), and lty = C (NA, NA, 1).
The following information is used in Questions 3 and 4.
The diamonds data in the ggplot2 package contains the prices and other measurements of almost 54,000
round cut diamonds.
Use scatterplots to explore possible relationships between the four numeric variables carat, depth,
and price. Based on these plots, which two variables seem to have the strongest relationship? Does this
relationship appear to be linear or nonlinear?
Construct a scatterplot between the two numeric variables with the strongest relationship. Put the variable
with higher variability on the y-axis.
Change the point character from the default open circle to a different symbol of your choice.
Shrink the size of the points to between 10 and 50% of the default size so that the points are easier to
Set the col argument to the clarity variable (i.e., col = clarity) to color the points in the scatterplot
according to the clarity of each diamond.
Add a legend that explains the color coding in the plot.
Explain why the default colors of 1 to 8 are chosen.
Hint: What is the type/class of the clarity column?
Construct the same scatterplot from (b) again, but change the default colors to different colors of your choice.
The color of the points should still correspond to the clarity of each diamond. Be sure to update the legend
to correspond to your new color scheme.
Hint: The 657 built-in color names can be found with the command colors(), or view them here: http:
Interpret the scatterplot from (b) or (c). What does the three-way relationship you observe tell you about
the diamonds in the data?
Compute the mean price for each color and cut combination in the diamonds data, and store the result.
The result should be a matrix object, where the rows correspond to the levels of color and the columns
correspond to the levels of cut.
Hint: This computation can be done with one command.
The matplot() function plots columns of a matrix (or plots columns of one matrix against the columns of
another). Use the matplot () function on the output from (a) to create a line plot with a separate line for
each level of cut.
Distinguish each line by separate line types, line widths, and/or colors of your choice.
Use the xaxt = "n" argument to suppress the tick marks and labels on the x-axis, then use the axis ()
function to set the x-axis labels to be the levels of the color variable.
Add a legend that explains the differences in the lines in the plot.
Interpret the line plot from (b). Does it appear that the mean price of diamonds differs for different levels of
color? For different levels of cut? Which levels tend to have higher mean prices?
The following information is used in Questions 5, 6, and 7.
Consider the dataset found at: http://www.isi-stats.com/isi/data/chap3/CollegeMidwest.txt
The data contains two variables gathered from the registrar at a small midwestern college on all students at
the college in spring 2011.
The variables are:
OnCampus: Whether or not a student lives on campus (Y or N)
CumGpa: The student's cumulative GPA.
Note: Since this is data on all students at the college, we will treat the students observed in this data to be
between the students who live off campus and the students who live on campus. Simulate the difference in
sample means from 1000 random samples of size 30.
1: For consistency, compute xon - xoff, where xon and xoff are the respective sample means for students
who live on and off campus.
Hint 2: The total sample size for each repetition should be 30, not 60.
Plot a histogram of the sampling distribution of differences in sample means from part (a). Add
that show the differences in sample means that are 2 standard errors away from the mean.
Compute the mean and standard deviation of the simulated distribution of differences in sample means. Use
these values to superimpose a normal curve over the histogram.
Suppose we observe a random sample of size 30 with an observed difference in mean cumulative GPA between
off campus and on campus students to be 0.48. Based on your simulated (approximate) sampling distribution
in part (a), what is the (approximate) probability of observing a difference in sample means greater than
Hint: Part (d) does not rely on the Central Limit Theorem.
Suppose we are interested in using a random sample of 30 students to decide if the mean cumulative GPA of
the population of students at the College of the Midwest is different from 3.5 or not. The null and alternative
hypotheses are given by
H0 : H = 3.5
Ha : u # 3.5
The t. test () function performs one and two sample t-tests on vectors of numeric data. The basic syntax for
t. test () is t. test (x ,y, alternative, mu, conf level).
The t. test () function inputs a vector X of values from your sample and conducts a one-sample t-test
for the mean. If a second vector in the argument y is included, t. test () will conduct a two-sample
t-test for a difference in means.
The alternative argument inputs a character value of "two sided", "greater", or "less", depending
on the alternative hypothesis we are considering. By default, t. test () will conduct a two-sided
hypothesis (i.e., alternative = "two sided").
The mu argument inputs a numeric value that specifies the value of the mean parameter H under
null hypothesis. The default value is mu = o, i.e., the default null hypothesis is 11 = 0.
In addition to a hypothesis test, the t. test () function also outputs a confidence interval, with confidence
level set by the conf. level argument. By default, the confidence level is set to conf .level = 0.95.
lk Set the seed to 30 and draw a random sample of size 30 from the CollegeMidwest. data.
For the random sample in (a), compute the observed t-statistic t = x - u , where x is the sample mean, S is
the sample standard deviation, and n is the sample size. How would you interpret this value?
Use the t. test () function to conduct a one-sample t-test to decide if the true mean cumulative GPA of all
the students at the College of the Midwest is different 3.5. Use a significance level of a = 0.05.
What is the mode and class of the output of t. test () from (c)? Use this information to extract the 95%
confidence interval vector from the t. test () output object. Is 3.5 inside this interval? What does this say
about whether the true mean cumulative GPA is 3.5 or not?
Note: The t-test (and t. test()) relies on the normal approximation to the sampling distribution of the
sample mean. When conducting a t-test, it is assumed that the conditions for the Central Limit Theorem are
The confidence level refers to the long-run proportion of random samples (of a fixed size) whose confidence
intervals contain the true population parameter. We want to illustrate this by simulation.
Suppose conducting a survey of students from the College of the Midwest consists of the following steps:
(1) Select a random sample of 30 students.
(2) Compute the mean cumulative GPA for the 30 students in the sample.
(3) Construct a 95% confidence interval for the population mean cumulative GPA.
Set the seed to 24601 and repeat steps 1, 2, and 3 a total of 10000 times. For each random sample, calculate
x and construct a 95% confidence interval.
Hint: You can use the t. test () function from the previous question to construct the 95% confidence interval.
Use the full data to compute the true mean cumulative GPA for the population of all students at the College
Midwest. Find the proportion of the 10000 confidence intervals that contain the true population mean.
Is this proportion consistent with what you expected?
Create a plot of the first 100 confidence intervals. Be sure that the plot satisifies the following criteria:
The limits of the axes should be large enough to contain the lengths of all of the confidence intervals.
Represent each sample mean by a point and each corresponding confidence interval by a line segment
through the point.
Color the points and intervals to correspond to whether the interval was successful at capturing
true population mean. In other words, use one color for the intervals that contain the true mean, and
use a different color for the intervals that do not contain the true mean.
Add a straight line that shows the true population mean.
Add a legend that explains the color coding in the plot.
Hint: Use the segments() function. You do not need a f or () loop to create this plot.
These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.