**Subject Mathematics Statistics-R Programming**

## Question

The function `CLT.demo` in the chunk at the top of this document provides code that shows the sampling distribution of the sample average for samples of various sizes. It needs an argument named `PMF`, which is a vector of values giving the probability of observing a 0, a 1, a 2, etc.

a) In the R chunk below, I have defined `shape` to be a vector of 15 ones Replace these 1s with integers to define the shape of a custom PMF of your choosing. Try to make the shape weird! Once that is done, convert the shape into a PMF and left-arrow it into a vector named `PMF` and give a barplot of its distribution.

```{r Q1a}

shape <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

PMF <- shape/sum(shape)

barplot(PMF,names.arg=0:14)

```

b) Using `sample`, generate two random samples from this PMF (the first argument to `sample` will be `0:14`. One with a sample size of 25, another with a sample size of 5000. Provide histograms of each (adding the argument `breaks=-1:14)` to each call of histogram). Notice how in the smaller sample, the histogram only has a passing resemblence to the barplot of the PMF, but the histogram of the larger sample looks a lot like the PMF. **When we add more data to a sample, the histogram of its values look more and more like the probability distribution generating the data**

```{r Q1b}

sample1 <- sample(0:14,size=25,replace=TRUE,prob=PMF)

hist(sample1,breaks=-1:14)

```

c) Run `CLT.demo(PMF)` (in the global environment once you read in the .RData file). Note that when we look at the sampling distribution of the possible sample averages that we might measure when we go collect data looks more and more like a Normal distribution as we consider samples of larger sizes. **When we consider samples with progressively larger sizes, the histogram of the possible averages we might measure from them looks more and more like a Normal curve**

```{r Q1c}

```

******************

**Question 2** Returns have long been the nemesis of many retail brands. When a product is returned or exchanged, not only does the retailer experience incremental supply chain costs, but often the item cannot be resold at the original price owing to damage, wear and tear, or obsolescence/devaluation given the passage of time — particularly an issue with fashion or seasonal merchandise.

Consider the data in `RETURNS`, which contains information on the return rates for 19 different products sold by an online retailer (the products are given codenames to maintain company privacy, but you can think of them as products such as Rolex Watches, Microwaves, Dresses, etc.). Each row gives that percentage of that type of product sold for a particular week that was eventually returned.

For the purposes of this problem, we'll assume that *some* probability distribution is generate the observed return rates (though we don't know what it is), and that the return rates in the data are a random sample from this PDF.

```{r Q2 returns data}

hist(RETURNS$ReturnRate)

summary(RETURNS$ReturnRate)

```

a) Report the mean return rate, the standard deviation of the return rates, and the standard error of the average return rate. In the context of this problem, provide a layman's interpretation of each of these quantities.

```{r Q2a}

```

**Response:**

b) One of the most common points of confusion when learning about confidence intervals for the average is the difference between the standard deviation of the data values in our sample and the standard error of the average. Briefly explain the difference between the two.

**Response:**

c) A confidence inteval for `mu` (the average of whatever PDF is generating the return rates in the data) looks like "sample average" plus or minus "some number of standard errors", i.e.

$$\bar{x} \pm ? SE$$

How many standard errors do we need to go from the average in the data (plus or minus) to create an interval that has 90% confidence? I want to see the relevant command (it starts with the letter `q`) we use in R to get this number, not just the number itself.

```{r Q2c}

```

d) Manually or with `t.test`, provide a 95% confidence interval for `mu`, the average the PDF that is generating the observed return rates.

```{r Q2d}

```

e) The average return rate of this company's competitors is 20%. Does the data (specifically, the confidence interval) suggest that the return rate at this company is better than its competitors? Explain.

**Response:**

******************

**Question 3** Different products look to have different average return rates:

```{r Q3}

#Run as is

par(las = 1,mar=c(5,8,4,2)+0.1) ; boxplot( ReturnRate ~ Codename, data=RETURNS, horizontal=TRUE,ylab="",col="skyblue"); par(las = 0,mar=c(5,5,4,2)+0.1)

aggregate(ReturnRate ~ Codename, data=RETURNS, FUN=mean )

```

a) It is desired to compare the average return rate for Codename "Orange" and Codename "Lemon" (21.4% vs. 22.0%). Whenever we compare two groups, the first question we have to ask ourselves is: "are these paired samples or independent samples?" Which type of samples are we dealing with; explain.

**Response:**

b) Produce a 95% confidence interval for the difference in average return rates between "Orange" and "Lemon". Can the data discern a difference in the "true" average return rates between these groups? Why or why not? Note: a subset of the data with just these two groups is provided for you.

```{r Q3b}

TWOGROUPS <- droplevels( subset( RETURNS, Codename %in% c("Lemon","Orange") ) )

```

**Response:**

c) Codename Carmine has the lowest average return rate. Codename Amethyst has the second lowest. Can the data discern a difference in the "true" average return rates between these groups? Explain.

```{r Q3c}

TWOGROUPS <- droplevels( subset( RETURNS, Codename %in% c("Amethyst","Carmine") ) )

```

**Response:**

d) It's desired to "rank" the return rates (from highest to lowest) of the 6 most problematic Codenames: "Sangria", "Sapphire", "Puce", "Slate gray", "Orangered", "Lemon". Product a connecting letters report. What product(s) have the highest return rate? What product(s) have the lowest return rate? Can the data discern a difference between "Puce" and "Orangered"? Explain.

```{r Q3d}

c("Sangria", "Sapphire", "Puce", "Slate gray", "Orangered", "Lemon")

GROUPS <- droplevels(subset( RETURNS, Codename %in% c("Sangria","Sapphire","Puce","Slate gray","Orangered","Lemon")))

```

**Response:**

******************

**Question 4** It is desired to fit a probability distribution to the overall sample of return rates. Since they are numbers between 0 and 1, the beta distribution is a logical choice.

a) Using `fitdist`, fit a beta distribution to the values in the `ReturnRate` column. Print to the screen the result of running `summary` on the fit, and also provide the plot from `qqcomp`.

```{r Q4a}

```

b) Discuss whether or not the beta distribution provides a reasonable fit to the data.

**Response:**

c) From the fit, the MLE estimates of the two parameters are close to 4 and 16. If we could get away with taking them to be these values, the formula for the PDF is easy to write down $f(x) = 15504\frac{x^3}{(1-x)^15}$.

Provide 95% confidence intervals for the two parameters. Comment whether 4 is a valid choice for the first parameter and whether 16 is a valid choice for the second parameter.

```{r Q4c}

```

**Response:**

******************

**Question 5** The `BABY` dataset in the global envirnoment (after reading in the .RData file) records information about newborn babies (birthweight, gestation period, characteristics of the mother and father). The stereotype is that men typically marry younger women.

a) Is the "large enough" sample sizes condition met here for any confidence intervals we create to be reliable? Why or why not?

```{r Q5}

```

**Response:**

b) Produce a 95% confidence interval for the "true" difference in average age between husbands and wives, and comment as to whether the data supports this stereotype. Hint: revisit Question 2a to help you set up `t.test` correctly.

```{r Q5b}

```

**Response:**

c) Produce a 95% confidence interval for the median birthweight by considering order statistics. Use the `get_CI_for_percentile` function in the global environment (after reading in the .RData file) to find the relevant order statistics.

```{r Q5c}

```

## Solution Preview

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.

For (almost any) probability distribution, find $\mu$ (the expected value) and $\sigma$ (the standard deviation). Once you have those two numbers, you can predict how far the average ($\bar{x}$) that you'll measure in a sample (of size $n$) will be from $\mu$ (typically about $\sigma/\sqrt{n}$). You can even put probabilistic bounds on how far $\bar{x}$ will be from $\mu$ (e.g. a 90% chance that $\bar{x}$ will be at most $1.65\sigma/\sqrt{n}$) since it sampling distribution will be approximately Normal *regardless of whatever PDF/PMF is making the data*.

The function `CLT.demo` in the chunk at the top of this document provides code that shows the sampling distribution of the sample average for samples of various sizes. It needs an argument named `PMF`, which is a vector of values giving the probability of observing a 0, a 1, a 2, etc.

a) In the R chunk below, I have defined `shape` to be a vector of 15 ones Replace these 1s with integers to define the shape of a custom PMF of your choosing. Try to make the shape weird! Once that is done, convert the shape into a PMF and left-arrow it into a vector named `PMF` and give a barplot of its distribution.

```{r Q1a}

shape <- c(1, 2, 1, 3, 1, 2, 1, 6, 1, 1, 2, 3, 4, 6, 5)

PMF <- shape/sum(shape)

barplot(PMF,names.arg=0:14)

```

b) Using `sample`, generate two random samples from this PMF (the first argument to `sample` will be `0:14`. One with a sample size of 25, another with a sample size of 5000. Provide histograms of each (adding the argument `breaks=-1:14)` to each call of histogram). Notice how in the smaller sample, the histogram only has a passing resemblence to the barplot of the PMF, but the histogram of the larger sample looks a lot like the PMF. **When we add more data to a sample, the histogram of its values look more and more like the probability distribution generating the data**

```{r Q1b}

sample1 <- sample(0:14,size=25,replace=TRUE,prob=PMF)

hist(sample1,breaks=-1:14)

sample2 <- sample(0:14,size=5000,replace=TRUE,prob=PMF)

hist(sample2,breaks=-1:14)

```

c) Run `CLT.demo(PMF)` (in the global environment once you read in the .RData file). Note that when we look at the sampling distribution of the possible sample averages that we might measure when we go collect data looks more and more like a Normal distribution as we consider samples of larger sizes. **When we consider samples with progressively larger sizes, the histogram of the possible averages we might measure from them looks more and more like a Normal curve**

```{r Q1c}

CLT.demo(PMF)

```...

This is only a preview of the solution. Please use the purchase button to see the entire solution