## Transcribed Text

Exercise 1.
Load the ncaa2018.csv data set and create histograms, QQ-norm and box-whisker plots for ELO. Add a title
to each plot, identifying the data.
Part b
A common recommendation to address issues of non-normality is to transform data to correct for skewness.
One common transformation is the log transform.
Transform ELO to log(ELO) and produce histograms, box-whisker and qqnorm plots of the transformed values.
Are the transformed values more orless skewed than the original? (Note - the log transform is used to correct
skewness, it is less useful for correcting kurtosis).
Exercise 3.
Wewill create a series of graphs illustrating how the Poisson distribution approaches the normal distribution
with large λ. We will iterate over a sequence of lambda, from 2 to 64, doubling lambda each time. For each
‘lambda’ draw 1000 samples from the Poisson distribution.
Calculate the skewnessof eachsetof samples, andproducehistograms,QQ-normandbox-whiskerplots. You
can use par(mfrow=c(1,3)) to display all three for one lambda in one line. Add lambda=## to the title of
the histogram, and skewness=## to the title of the box-whisker plot.
Part b.
Remember that lambda represents the mean of a discrete (counting) variable. At what size mean is Poisson
data no longer skewed,relative to normally distributed data? You might run this 2 or 3 times, with different
seeds; this number varies in my experience.
par(mfrow=c(1,3))
If you do this in SAS, create a data table with data columns each representing a different µ. You can see
combined histogram, box-whisker and QQ-norm, for all columns, by calling
proc univariate data=Distributions plot;
run;
At what µ is skewness of the Poisson distribution small enough to be considered normal
2
Exercise 4
Part a
Write a function that accepts a vector vec, a vector of integers, a main axis label and an x axis label. This
function should 1. iterate over each element i in the vector of integers 2. produce a histogram for vec setting
the number of bins in the histogram to i 3. label main and x-axis with the specified parameters. 4. label the
y-axis to read Frequency, bins = and the number of bins.
Hint: You can simplify this function by using the parameter ... - see ?plot or ?hist
Part b
Test your function with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be
able to call your function with something like
plot.histograms(hidalgo.dat[,1],c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")
to plot three different histograms of the hidalgo data set.
If you do this in SAS, write a macro that accepts a table name, a column name, a list of integers, a main axis
label and an x axis label. This macro should scan over each element in the list of integers and produce a
histogram for each integer value, setting the bin count to the element in the input list, and labeling main
and x-axis with the specified parameters. You should label the y-axis to read Frequency, bins = and the
number of bins.
Test your macro with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be
able to call your macro with something like
%plot_histograms(hidalgo, y, 12 36 60, main="1872 Hidalgo issue", xlabel="Thickness (mm)");
to plot three different histograms of the hidalgo data set.
Hint: Assume 12 36 60 resolve to a single macro parameter and use %scan. Your macro definition can look
something like
%macro plot_histograms(table_name, column_name, number_of_bins, main="Main", xlabel="X Label")
Data
The hidalgo data set is in the file hidalgo.dat These data consist of paper thickness measurements of
stamps from the 1872 Hidalgo issue of Mexico. This data set is commonly used to illustrate methods
of determining the number of components in a mixture (in this case, different batches of paper).
Some analysis suggest there are three different mixtures of paper used to produce the 1872 Hidalgo issue;
other analysis suggest seven. Why do you think there might be disagreement about the number of mixtures?
3
Exercise 5.
We’ve been working with data from Wansink and Payne, Table 1:
Reproducing part of Wansink Table 1
(SD)
However, in Homework 2, we also considered the value given in the text
The resulting increase of 168.8 calories (from 268.1 calories . . . to 436.9 calories . . . ) represents
a 63.0% increase . . . in calories per serving.
There is a discrepancy between two values reported for calories per serving, 2006. We will use graphs to
attempt to determine which value is most consistent.
First, consider the relationship between Calories per Serving and Calories per Recipe:
Calories per Serving = Calories per Recipe / Servings per Recipe
Since Servings per Recipe is effectively constant over time (12.4-13.0), we can assume the relationship
between Calories per Serving and Calories per Recipe is linear,
Calories per Serving=β0 +β1×Calories perRecipe
with Servings per Recipe = 1/β1
We will fit a linear model, with Calories per Recipe as the independent variable against two sets of values
for Calories per Serving, such that
• Assumption 1. The value in the table (384.4) is correct.
• Assumption 2. The value in the text (436.9) is correct.
We use the data:
Assumptions.dat <- data.frame(
CaloriesPerRecipe = c(2123.8, 2122.3, 2089.9, 2250.0, 2234.2, 2249.6, 3051.9),
Assumption1 = c(268.1, 271.1, 280.9, 294.7, 285.6, 288.6, 384.4),
Assumption2 = c(268.1, 271.1, 280.9, 294.7, 285.6, 288.6, 436.9))
Measure 1936 1946 1951 1963 1975 1997 2006
calories 2123.8 2122.3 2089.9 2250.0 2234.2 2249.6 3051.9
per recipe (1050.0) (1002.3) (1009.6) (1078.6) (1089.2) (1094.8) (1496.2)
(SD)
calories 268.1 271.1 280.9 294.7 285.6 288.6 384.4
per (124.8) (124.2) (116.2) (117.7) (118.3) (122.0) (168.3)
serving
servings 12.9 12.9 13.0 12.7 12.4 12.4 12.7
per recipe (13.3) (13.3) (14.5) (14.6) (14.3) (14.3) (13.0)
(SD)
4
Assumption1.lm <- lm(Assumption1 ~ CaloriesPerRecipe,data=Assumptions.dat)
Assumption2.lm <- lm(Assumption2 ~ CaloriesPerRecipe,data=Assumptions.dat)
summary(Assumption1.lm)
and fit linear models
##
## Call:
## lm(formula = Assumption1 ~ CaloriesPerRecipe, data = Assumptions.dat)
##
## Residuals:
## 1 2 3 4 5 6 7
## -7.0238 -3.8475 9.7610 4.7417 -2.5010 -1.3112 0.1808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.477429 17.351550 1.468 0.202
## CaloriesPerRecipe 0.117547 0.007466 15.745 1.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.163 on 5 degrees of freedom
## Multiple R-squared: 0.9802, Adjusted R-squared: 0.9763
## F-statistic: 247.9 on 1 and 5 DF, p-value: 1.879e-05
summary(Assumption2.lm)
##
## Call:
## lm(formula = Assumption2 ~ CaloriesPerRecipe, data = Assumptions.dat)
##
## Residuals:
## 1 2 3 4 5 6 7
## -4.1798 -0.9169 14.5608 0.3051 -6.0261 -5.7248 1.9817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -99.891018 21.933161 -4.554 0.00609 **
## CaloriesPerRecipe 0.175238 0.009437 18.569 8.34e-06 ***
## ---
5
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.79 on 5 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9828
## F-statistic: 344.8 on 1 and 5 DF, p-value: 8.336e-06
Part a.
Plot the regression. Use points to plot Assumption1 vs CaloriesPerRecipe, and Assumption2 vs
CaloriesPerRecipe, on the same graph. Add lines (i.e. abline) to show the fit from the regression. Use
different colors for the two assumptions. Which of the two lines appears to best explain the data?
Part b.
Producediagnostic plots plots ofthe residuals from both linear models (inR,use residuals(Assumption1.lm)).
qqnorm or box-whisker plots will probably be the most effective; there are too few points for a histogram.
Use the code below to place two plots, side by side. You can produce more than one pair of plots, if you wise.
par(mfrow=c(1,2))
par(mfrow=c(1,2))
Fromthese plots, whichassumptionismostlikely correct. Thatis, which assumptionproduces a linear model
that least violates assumptions of normality of the residual errors? Which assumption produces outliers in
the residuals?
I’ve included similar data and linear models for SAS in the SAS template. If you choose SAS, you will need
to modify the PROC GLM code to produce the appropriate diagnostic plots.

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

# 1.

# install.packages("moments")

# library(moments)

ncaa2018 <- read.csv("ncaa2018.csv", header=TRUE)

names(ncaa2018) # 766 x 6

ELO <- ncaa2018$ELO

par(mfrow=c(2,3))

hist(ELO, main="Histogram of ELO")

qqnorm(ELO, main="Quantile-Quantile Plot for EOL ")

qqline(ELO, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7)

boxplot(ELO, main="Boxplot of ELO", xlab="ELO")

# part b

logELO <- log(ELO)

hist(logELO, main="Histogram of log(ELO)", xlab="log(ELO)")

qqnorm(logELO, main="Quantile-Quantile Plot for log(ELO) ")

qqline(logELO, datax = FALSE, distribution = qnorm, probs = c(0.25, 0.75), qtype = 7)

boxplot(logELO, main="Boxplot of log(ELO)", xlab="log(ELO)")

# The log-transformation on ELO does not seem to make it look more like a

# normal distribution. All of histograms, qqplots, and boxplots show very

# similar distributions. Using different base for the logarithem instead of e

# could be usesful attempts to make the distributio of ELO look more normal....