 # R Programming Problems

## Transcribed Text

Exercise 1. Load the ncaa2018.csv data set and create histograms, QQ-norm and box-whisker plots for ELO. Add a title to each plot, identifying the data. Part b A common recommendation to address issues of non-normality is to transform data to correct for skewness. One common transformation is the log transform. Transform ELO to log(ELO) and produce histograms, box-whisker and qqnorm plots of the transformed values. Are the transformed values more orless skewed than the original? (Note - the log transform is used to correct skewness, it is less useful for correcting kurtosis). Exercise 3. Wewill create a series of graphs illustrating how the Poisson distribution approaches the normal distribution with large λ. We will iterate over a sequence of lambda, from 2 to 64, doubling lambda each time. For each ‘lambda’ draw 1000 samples from the Poisson distribution. Calculate the skewnessof eachsetof samples, andproducehistograms,QQ-normandbox-whiskerplots. You can use par(mfrow=c(1,3)) to display all three for one lambda in one line. Add lambda=## to the title of the histogram, and skewness=## to the title of the box-whisker plot. Part b. Remember that lambda represents the mean of a discrete (counting) variable. At what size mean is Poisson data no longer skewed,relative to normally distributed data? You might run this 2 or 3 times, with different seeds; this number varies in my experience. par(mfrow=c(1,3)) If you do this in SAS, create a data table with data columns each representing a different µ. You can see combined histogram, box-whisker and QQ-norm, for all columns, by calling proc univariate data=Distributions plot; run; At what µ is skewness of the Poisson distribution small enough to be considered normal 2 Exercise 4 Part a Write a function that accepts a vector vec, a vector of integers, a main axis label and an x axis label. This function should 1. iterate over each element i in the vector of integers 2. produce a histogram for vec setting the number of bins in the histogram to i 3. label main and x-axis with the specified parameters. 4. label the y-axis to read Frequency, bins = and the number of bins. Hint: You can simplify this function by using the parameter ... - see ?plot or ?hist Part b Test your function with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your function with something like plot.histograms(hidalgo.dat[,1],c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)") to plot three different histograms of the hidalgo data set. If you do this in SAS, write a macro that accepts a table name, a column name, a list of integers, a main axis label and an x axis label. This macro should scan over each element in the list of integers and produce a histogram for each integer value, setting the bin count to the element in the input list, and labeling main and x-axis with the specified parameters. You should label the y-axis to read Frequency, bins = and the number of bins. Test your macro with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your macro with something like %plot_histograms(hidalgo, y, 12 36 60, main="1872 Hidalgo issue", xlabel="Thickness (mm)"); to plot three different histograms of the hidalgo data set. Hint: Assume 12 36 60 resolve to a single macro parameter and use %scan. Your macro definition can look something like %macro plot_histograms(table_name, column_name, number_of_bins, main="Main", xlabel="X Label") Data The hidalgo data set is in the file hidalgo.dat These data consist of paper thickness measurements of stamps from the 1872 Hidalgo issue of Mexico. This data set is commonly used to illustrate methods of determining the number of components in a mixture (in this case, different batches of paper). Some analysis suggest there are three different mixtures of paper used to produce the 1872 Hidalgo issue; other analysis suggest seven. Why do you think there might be disagreement about the number of mixtures? 3 Exercise 5. We’ve been working with data from Wansink and Payne, Table 1: Reproducing part of Wansink Table 1 (SD) However, in Homework 2, we also considered the value given in the text The resulting increase of 168.8 calories (from 268.1 calories . . . to 436.9 calories . . . ) represents a 63.0% increase . . . in calories per serving. There is a discrepancy between two values reported for calories per serving, 2006. We will use graphs to attempt to determine which value is most consistent. First, consider the relationship between Calories per Serving and Calories per Recipe: Calories per Serving = Calories per Recipe / Servings per Recipe Since Servings per Recipe is effectively constant over time (12.4-13.0), we can assume the relationship between Calories per Serving and Calories per Recipe is linear, Calories per Serving=β0 +β1×Calories perRecipe with Servings per Recipe = 1/β1 We will fit a linear model, with Calories per Recipe as the independent variable against two sets of values for Calories per Serving, such that • Assumption 1. The value in the table (384.4) is correct. • Assumption 2. The value in the text (436.9) is correct. We use the data: Assumptions.dat <- data.frame( CaloriesPerRecipe = c(2123.8, 2122.3, 2089.9, 2250.0, 2234.2, 2249.6, 3051.9), Assumption1 = c(268.1, 271.1, 280.9, 294.7, 285.6, 288.6, 384.4), Assumption2 = c(268.1, 271.1, 280.9, 294.7, 285.6, 288.6, 436.9)) Measure 1936 1946 1951 1963 1975 1997 2006 calories 2123.8 2122.3 2089.9 2250.0 2234.2 2249.6 3051.9 per recipe (1050.0) (1002.3) (1009.6) (1078.6) (1089.2) (1094.8) (1496.2) (SD) calories 268.1 271.1 280.9 294.7 285.6 288.6 384.4 per (124.8) (124.2) (116.2) (117.7) (118.3) (122.0) (168.3) serving servings 12.9 12.9 13.0 12.7 12.4 12.4 12.7 per recipe (13.3) (13.3) (14.5) (14.6) (14.3) (14.3) (13.0) (SD) 4 Assumption1.lm <- lm(Assumption1 ~ CaloriesPerRecipe,data=Assumptions.dat) Assumption2.lm <- lm(Assumption2 ~ CaloriesPerRecipe,data=Assumptions.dat) summary(Assumption1.lm) and fit linear models ## ## Call: ## lm(formula = Assumption1 ~ CaloriesPerRecipe, data = Assumptions.dat) ## ## Residuals: ## 1 2 3 4 5 6 7 ## -7.0238 -3.8475 9.7610 4.7417 -2.5010 -1.3112 0.1808 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.477429 17.351550 1.468 0.202 ## CaloriesPerRecipe 0.117547 0.007466 15.745 1.88e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.163 on 5 degrees of freedom ## Multiple R-squared: 0.9802, Adjusted R-squared: 0.9763 ## F-statistic: 247.9 on 1 and 5 DF, p-value: 1.879e-05 summary(Assumption2.lm) ## ## Call: ## lm(formula = Assumption2 ~ CaloriesPerRecipe, data = Assumptions.dat) ## ## Residuals: ## 1 2 3 4 5 6 7 ## -4.1798 -0.9169 14.5608 0.3051 -6.0261 -5.7248 1.9817 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -99.891018 21.933161 -4.554 0.00609 ** ## CaloriesPerRecipe 0.175238 0.009437 18.569 8.34e-06 *** ## --- 5 ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.79 on 5 degrees of freedom ## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9828 ## F-statistic: 344.8 on 1 and 5 DF, p-value: 8.336e-06 Part a. Plot the regression. Use points to plot Assumption1 vs CaloriesPerRecipe, and Assumption2 vs CaloriesPerRecipe, on the same graph. Add lines (i.e. abline) to show the fit from the regression. Use different colors for the two assumptions. Which of the two lines appears to best explain the data? Part b. Producediagnostic plots plots ofthe residuals from both linear models (inR,use residuals(Assumption1.lm)). qqnorm or box-whisker plots will probably be the most effective; there are too few points for a histogram. Use the code below to place two plots, side by side. You can produce more than one pair of plots, if you wise. par(mfrow=c(1,2)) par(mfrow=c(1,2)) Fromthese plots, whichassumptionismostlikely correct. Thatis, which assumptionproduces a linear model that least violates assumptions of normality of the residual errors? Which assumption produces outliers in the residuals? I’ve included similar data and linear models for SAS in the SAS template. If you choose SAS, you will need to modify the PROC GLM code to produce the appropriate diagnostic plots.

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

# 1.

# install.packages("moments")
# library(moments)

ncaa2018 <- read.csv("ncaa2018.csv", header=TRUE)
names(ncaa2018) # 766 x 6
ELO <- ncaa2018\$ELO
par(mfrow=c(2,3))
hist(ELO, main="Histogram of ELO")
qqnorm(ELO, main="Quantile-Quantile Plot for EOL ")
qqline(ELO, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7)
boxplot(ELO, main="Boxplot of ELO", xlab="ELO")

# part b
logELO <- log(ELO)
hist(logELO, main="Histogram of log(ELO)", xlab="log(ELO)")
qqnorm(logELO, main="Quantile-Quantile Plot for log(ELO) ")
qqline(logELO, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7)
boxplot(logELO, main="Boxplot of log(ELO)", xlab="log(ELO)")

# The log-transformation on ELO does not seem to make it look more like a
# normal distribution. All of histograms, qqplots, and boxplots show very
# similar distributions. Using different base for the logarithem instead of e
# could be usesful attempts to make the distributio of ELO look more normal....
\$48.00 for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

### Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

• 1
• 2
• 3
Live Chats