QuestionQuestion

Transcribed TextTranscribed Text

General Guidelines Please use R Markdown for your submission. Include the following files: Your Rmd file. The compiled/knitted HTML document. The knitted document should be clear, well-formatted, and contain all relevant R code, output, and explanations. R code style should follow the Tidyverse style guide: https://style.tidyverse.org/ Collaboration must adhere to Level 1 collaboration described in the Stats 20 Collaboration Policy. Note: All questions on this homework should be done using only functions or syntax discussed in the lecture notes. No credit will be given for use of outside functions. The following information is used in Questions 1 and 2. The births.csv file on CCLE contains data on a sample of babies born in North Carolina. We are interested in investigating the association between baby weight and the mother's smoking habit (smokers versus non-smokers). Question 1 (a) Read the dataset into R and save it to the workspace. Verify that the data has been loaded correctly. (b) The Habit variable contains the smoker status for the mother of each baby in the sample. There are several observations in the data for which there is no information about the mother's smoker status. How many observations are in this category? What is the name of the category (level) in the Habit variable for these observations? (c) Create a bar plot of the mother's smoker status in the data. Change the color of the bars from the default grey color to a different color of your choice. (d) The droplevels() function drops unused levels from a factor or from factors in a data frame. Extract the observations in the data for which we know the mother's smoker status (either smoker or non-smoker) and drop the unused level from the Habit variable. Save this subset of the data as a separate data frame in the workspace. The subset of the data should contain all of the variables from the original data. (e) Using the data frame from (d), create side-by-side boxplots of the weight variable split by the mother's smoker status. Question 2 Create overlapping relative frequency histograms for the distributions of weight split by the mother's smoker status (either smoker or non-smoker). Be sure that the plot satisifies the following criteria: Change the x-label and main title to be more informative. Use different colors for the two histograms. Change the density of the shading so that the overlap between the histograms is visible. Superimpose density curves of the data, matching the color of the curves to the corresponding histogram. Add vertical lines that show the median weights for each distribution. Add a legend that helps understand the various colors/components of the plot. Based on the plot, do you think there is a significant difference between the typical weight of a baby born to a mother who smokes and the typical weight of a baby born to a mother who does not smoke? Hint: For legends with mixed types of symbols (points, lines, boxes, etc.), the pch, lty, density, and border (and other) arguments use NA to exclude those arguments from modifying the corresponding entries in the legend (fill uses 0 instead of NA). For example, a legend with two box entries and one line entry could have arguments density = c(20, 30, NA), border = c(1, 1, NA), and lty = C (NA, NA, 1). The following information is used in Questions 3 and 4. The diamonds data in the ggplot2 package contains the prices and other measurements of almost 54,000 round cut diamonds. Question 3 (a) Use scatterplots to explore possible relationships between the four numeric variables carat, depth, table, and price. Based on these plots, which two variables seem to have the strongest relationship? Does this relationship appear to be linear or nonlinear? (b) Construct a scatterplot between the two numeric variables with the strongest relationship. Put the variable with higher variability on the y-axis. Change the point character from the default open circle to a different symbol of your choice. Shrink the size of the points to between 10 and 50% of the default size so that the points are easier to distinguish. Set the col argument to the clarity variable (i.e., col = clarity) to color the points in the scatterplot according to the clarity of each diamond. Add a legend that explains the color coding in the plot. Explain why the default colors of 1 to 8 are chosen. Hint: What is the type/class of the clarity column? (c) Construct the same scatterplot from (b) again, but change the default colors to different colors of your choice. The color of the points should still correspond to the clarity of each diamond. Be sure to update the legend to correspond to your new color scheme. Hint: The 657 built-in color names can be found with the command colors(), or view them here: http: /www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. (d) Interpret the scatterplot from (b) or (c). What does the three-way relationship you observe tell you about the diamonds in the data? Question 4 (a) Compute the mean price for each color and cut combination in the diamonds data, and store the result. The result should be a matrix object, where the rows correspond to the levels of color and the columns correspond to the levels of cut. Hint: This computation can be done with one command. (b) The matplot() function plots columns of a matrix (or plots columns of one matrix against the columns of another). Use the matplot () function on the output from (a) to create a line plot with a separate line for each level of cut. Distinguish each line by separate line types, line widths, and/or colors of your choice. Use the xaxt = "n" argument to suppress the tick marks and labels on the x-axis, then use the axis () function to set the x-axis labels to be the levels of the color variable. Add a legend that explains the differences in the lines in the plot. (c) Interpret the line plot from (b). Does it appear that the mean price of diamonds differs for different levels of color? For different levels of cut? Which levels tend to have higher mean prices? The following information is used in Questions 5, 6, and 7. Consider the dataset found at: http://www.isi-stats.com/isi/data/chap3/CollegeMidwest.txt The data contains two variables gathered from the registrar at a small midwestern college on all students at the college in spring 2011. The variables are: OnCampus: Whether or not a student lives on campus (Y or N) CumGpa: The student's cumulative GPA. Note: Since this is data on all students at the college, we will treat the students observed in this data to be the population. Question 5 (a) Set the seed to 9999, and simulate the sampling distribution of the difference in mean cumulative GPA between the students who live off campus and the students who live on campus. Simulate the difference in sample means from 1000 random samples of size 30. Hint 1: For consistency, compute xon - xoff, where xon and xoff are the respective sample means for students who live on and off campus. Hint 2: The total sample size for each repetition should be 30, not 60. (b) Plot a histogram of the sampling distribution of differences in sample means from part (a). Add vertical lines that show the differences in sample means that are 2 standard errors away from the mean. (c) Compute the mean and standard deviation of the simulated distribution of differences in sample means. Use these values to superimpose a normal curve over the histogram. (d) Suppose we observe a random sample of size 30 with an observed difference in mean cumulative GPA between off campus and on campus students to be 0.48. Based on your simulated (approximate) sampling distribution in part (a), what is the (approximate) probability of observing a difference in sample means greater than 0.48? Hint: Part (d) does not rely on the Central Limit Theorem. Question 6 Suppose we are interested in using a random sample of 30 students to decide if the mean cumulative GPA of the population of students at the College of the Midwest is different from 3.5 or not. The null and alternative hypotheses are given by H0 : H = 3.5 Ha : u # 3.5 The t. test () function performs one and two sample t-tests on vectors of numeric data. The basic syntax for t. test () is t. test (x ,y, alternative, mu, conf level). The t. test () function inputs a vector X of values from your sample and conducts a one-sample t-test for the mean. If a second vector in the argument y is included, t. test () will conduct a two-sample t-test for a difference in means. The alternative argument inputs a character value of "two sided", "greater", or "less", depending on the alternative hypothesis we are considering. By default, t. test () will conduct a two-sided hypothesis (i.e., alternative = "two sided"). The mu argument inputs a numeric value that specifies the value of the mean parameter H under the null hypothesis. The default value is mu = o, i.e., the default null hypothesis is 11 = 0. In addition to a hypothesis test, the t. test () function also outputs a confidence interval, with confidence level set by the conf. level argument. By default, the confidence level is set to conf .level = 0.95. (a) lk Set the seed to 30 and draw a random sample of size 30 from the CollegeMidwest. data. (b) For the random sample in (a), compute the observed t-statistic t = x - u , where x is the sample mean, S is s/Vn the sample standard deviation, and n is the sample size. How would you interpret this value? (c) Use the t. test () function to conduct a one-sample t-test to decide if the true mean cumulative GPA of all the students at the College of the Midwest is different 3.5. Use a significance level of a = 0.05. (d) What is the mode and class of the output of t. test () from (c)? Use this information to extract the 95% confidence interval vector from the t. test () output object. Is 3.5 inside this interval? What does this say about whether the true mean cumulative GPA is 3.5 or not? Note: The t-test (and t. test()) relies on the normal approximation to the sampling distribution of the sample mean. When conducting a t-test, it is assumed that the conditions for the Central Limit Theorem are satisfied. Question 7 The confidence level refers to the long-run proportion of random samples (of a fixed size) whose confidence intervals contain the true population parameter. We want to illustrate this by simulation. (a) Suppose conducting a survey of students from the College of the Midwest consists of the following steps: (1) Select a random sample of 30 students. (2) Compute the mean cumulative GPA for the 30 students in the sample. (3) Construct a 95% confidence interval for the population mean cumulative GPA. Set the seed to 24601 and repeat steps 1, 2, and 3 a total of 10000 times. For each random sample, calculate x and construct a 95% confidence interval. Hint: You can use the t. test () function from the previous question to construct the 95% confidence interval. (b) Use the full data to compute the true mean cumulative GPA for the population of all students at the College of the Midwest. Find the proportion of the 10000 confidence intervals that contain the true population mean. Is this proportion consistent with what you expected? (c) Create a plot of the first 100 confidence intervals. Be sure that the plot satisifies the following criteria: The limits of the axes should be large enough to contain the lengths of all of the confidence intervals. Represent each sample mean by a point and each corresponding confidence interval by a line segment through the point. Color the points and intervals to correspond to whether the interval was successful at capturing the true population mean. In other words, use one color for the intervals that contain the true mean, and use a different color for the intervals that do not contain the true mean. Add a straight line that shows the true population mean. Add a legend that explains the color coding in the plot. Hint: Use the segments() function. You do not need a f or () loop to create this plot.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

    By purchasing this solution you'll be able to access the following files:
    Solution.pdf.

    50% discount

    Hours
    Minutes
    Seconds
    $40.00 $20.00
    for this solution

    or FREE if you
    register a new account!

    PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

    Find A Tutor

    View available Statistics-R Programming Tutors

    Get College Homework Help.

    Are you sure you don't want to upload any files?

    Fast tutor response requires as much info as possible.

    Decision:
    Upload a file
    Continue without uploading

    SUBMIT YOUR HOMEWORK
    We couldn't find that subject.
    Please select the best match from the list below.

    We'll send you an email right away. If it's not in your inbox, check your spam folder.

    • 1
    • 2
    • 3
    Live Chats