Description. In this project, you will be conducting a complete data analysis of a real biological dataset. To do so, we will be utilizing the methods and techniques we have learned in this course, particularly what we have done in the data analysis portions of the homeworks.

The Dataset. The dataset we will analyze is the Plasma Retinol dataset.

Questions of Interest. The ultimate goal of every data analysis is to answer questions that are useful in a practical sense. Here are the questions that you should address in this data analysis. For each question below, you should answer the question as specifically as you can.

Question 1. What are the true average plasma retinol and (log) plasma beta-carotene levels in the population?

Question 2. Is there a difference in plasma retinol level between males and females?

Question 3. Is there a relationship between grams of fat and grams of fiber consumed per day?

Question 4. Is there a relationship between smoking status and gender?

Question 5. Is vitamin use in the population the same across all categories? (Vitamin Use categories: Yes, fairly often; Yes, not often; and No)

Question 6. Does plasmal retinol level differ across different smoking statuses? (Smoking Status categories: Never, Former, and Current Smoker)

Outline. Your data analysis should be fully typed and should be in a report format. Your data analysis report should consist of the following sections:

1. Introduction. In the introduction section, you will briefly describe the dataset. You should describe the study from which the data originated from, state the number of variables and observations in the dataset, briefly introduce the questions of interest, and describe in words the population from which the data (possibly theoretically) was drawn from.

2. Exploratory data analysis. In this section, you will provide the appropriate numerical and graphical summaries for each question of interest. You should describe all plots and numerical summaries that you obtain (similar to the data analysis portions of the homeworks). See the ”short guide” found later in this handout for more details.

3. Checking assumptions. Before using formal procedures address the questions of interest, check if the assumptions to be able to use these methods hold for our data. Not all assumptions are checkable, and not all assumptions will necessarily hold.

Tip. If an assumption is not checkable, you should mention this fact, and then provide your best guess to whether or not the assumption holds, given the provided information. If an assumption clearly does not hold, then you should prepare to use an alternative method (ex. nonparametric test if the normality assumption does not hold). If no alternative method is available, then you should state that the analysis results may not be valid, and proceed with the current method.

4. Formal procedures. In this section, you will either construct a confidence interval or conduct the appropriate hypothesis test (or both, if possible), to address each question of interest. Again, see the ”short guide” for details.

5. Conclusion. Summarize your findings. You should state here your answer to each question of interest based on the data analysis you have performed. If the assumptions for a particular analysis were either uncheckable or not met, you should state so. Which of your findings were the most interesting?

A Short Guide of What to Use When

This guide will help you figure out what to do in order to perform the specific analyses (Steps 2-4) for each question of interest. For each question of interest, you should determine the corresponding type of problem it falls under from the list below.

One sample problem, quantitative variable

Graphical summaries: Histogram, boxplot

Numerical summaries: mean, standard deviation

Formal procedures: Confidence interval (with t); and/or one-sample t test [only if there is a claim to test]

One sample problem, categorical variable

Graphical summaries: Barplot

Numerical summaries: Frequency, relative frequency, percentage relative frequency (for each category)

Formal procedures: Confidence interval (with p˜ and Z) [only if there’s exactly two categories]; and/or chi-square goodness-of-fit test [only if there is a claim to test]

Two sample problem, quantitative variable

Graphical summaries: Side-by-side boxplots

Numerical summaries: Mean and standard deviation for each sample

Formal procedures: Confidence interval for the difference, and 2 sample t test [pick the right scenario]; or WMW test [if normality assumption is not met]

Three+ sample problem, quantitative variable

Graphical summaries: Side-by-side boxplots

Numerical summaries: Mean and standard deviation for each sample

Formal procedures: ANOVA test

Two categorical variables

Graphical summaries: Not covered in this course (don’t need to do it)

Numerical summaries: 2-Way Frequency table

Formal procedures: Chi square test for independence

Two quantitative variables

Graphical summaries: Scatterplot (with least squares line)

Numerical summaries: Coefficients of correlation and determination, r and R2

Formal procedure: Test the slope β1

Note. We did not cover how to check assumptions for regression problems (two quantitative variables). Therefore, you do not need to worry about checking assumptions for any question of interest that involves using regression techniques.

**Subject Mathematics Statistics-R Programming**