QuestionQuestion

Transcribed TextTranscribed Text

You will investigate a dataset involving 100m perfor-mances for Male and Female track and field athletes. You will also explore the US States dataset. Any time that I use the words {Present, State, Give, Show, Predict, Display}, you must supply that graphic in your submission. If I say {Produce, Make}, you do not need to show what you produced or made, but you still need to do it. Benchmarks of 95% con-fidence and 5% significance can be used unless otherwise specified. A US States Load the US State data the same way I did in my sample code. You’ll have to bind the regions to the states data frame; you can find them in the object state.division. There should be 9 levels for this factor. 1. It’s usually a good idea to look at some summary statistics before building a model. (a) Compute a new column called Density = 1000 ∗ P opulation Area . (b) Present a correlation matrix showing the correlations between Life Expectancy, Density, Income, Murder, HS Grad and Frost, in that order. (c) We might need to do some transformations down the road. In preparation, present a histogram of each of these six numerical variables, laid out in a 3x2 grid. (d) It’s nice to know what the bivariate relationships look like. Produce a scatterplot of Life Expectancy vs. each of Density, Income, Murder, HS Grad and Frost. For a sixth plot, make a boxplot with Life Expectancy along the y-axis and Region along the x-axis. You can use with(df, boxplot(y ∼ x)) to accomplish this, and you do not need to show these 6 plots in your submission, because ... (e) It looks like we’re in need of some transformations. Redo the last six plots, this time using the log of Density. Present your plots in a 3x2 layout, on one full page in your report. (f) Everything looks good, we just need to make some indicator variables for the factor (Region). Actually, R can do this for us, but it’s a good exercise. Make indicators (1s and 0s) for all of the nine levels. You can store them as columns in the data frame. 2. Now it’s time to fit an MLR model. Let’s try to predict the Life Expectancy of a State using other demographic information. (a) Fit a linear model with Life Expectancy as the response, and log(Density), Income, Murder, HS graduation rate, Frost, and all of the Region indicators as predictors. (b) Oops, that was naive of us. If we want to use all of these indicators, we should fit a model with no intercept. Do that, and present the coefficient table (from summary.lm) using a fixed-width font. Those p-values (for the factor) aren’t too helpful with a no-intercept model. But, looking at the boxplots, it seems like we could get away with having an indicator solely for West North Central. Fit a reduced model with only that indicator (and all the other predictors too), and an intercept. The inclusion of an intercept will group the other eight regions into one group. (c) Perform a partial F-test to see if the eight other regions are equivalent with respect to mean life expectancy. Give the test statistic, df, and p-value, as well as a conclusion in plain English. (d) Good, let’s go with this reduced model. Identify the non-significant predictors (at the 5% level) and fit another reduced model without them. Compare to the model from the previous part with a partial F-test. Give p-value and conclusion. (e) Write out the equation for your final model, including the fitted coefficients you estimated using Least Squares. It should look like this except with actual numbers and variable names: Yˆ = b0 + b1 · X1 + ... (f) Give an interpretation, in plain English, of the coefficient of Murder. (1mk) (g) Give an interpretation, in plain English, of the coefficient of the indicator of West North Central region. 3. Let’s check the assumptions for this final model. (a) Show a plot of residuals vs. fitted values and a Normal QQ plot for your MLR model, on a 1x2 layout. Do you have any major concerns with them? If yes, say what your concerns are. If not, say why the plots look OK. (b) Show a plot of the leverage vs. DFFITs (absolute value). (c) There looks to be a point with pretty high DFFIT. Which state is that? (d) There looks to be a point with pretty high leverage. Which state is that? B Athletics Now let’s shift our attention over to the Athletics dataset. 1. Let’s prepare the 100m data for analysis. (a) Read in the two files, and add a column to each indicating the sex of the athlete. Bind the two files together. (b) Some of the times were run at high altitude, and have an ‘A’ in the time. Delete all of these rows, and convert the time to numeric. (c) Keep only the columns country, date, birth, wind, time and sex. (d) You’ll have to format those dates, but you’ll run into a problem with the 2-digit years. R thinks ‘69’ and later is 1900s, and ‘68’ and earlier is 2000s. There are a few ways to fix this; any of them are fine. Hint: subtract 100 years from all dates in a certain range, or append the century prefix based on a condition. (e) Make two factors in your data frame for the Year and Month the race was run. (f) Once that is done, compute the age of the athlete (in Years) and store as a column. You should note that there are 365.25 days in a year, on average. Delete rows for athletes with a missing age. (g) We’ve also got to fix the wind measurements. First, replace all commas with a decimal point. (h) Next, readings close to zero have a ± sign beside them in this file. You could try to remove it, or just search for numbers that also have ‘0.0’ and overwrite them with 0. That should leave you with all numbers, plus a few missing values. Delete rows without wind. (i) Finally, 100m times are highly dependent on wind. In order to compare them, we should adjust the times for wind. A positive wind is a tailwind, and it’s estimated to add 0.05s to a man’s time, for every 1.0 m/s of wind (0.06s to a woman’s time). Convert the times to what they would equivalently be with no wind (so, for example, 11.00s in a +1.5m/s wind becomes 11.09s for a female). The time could also get faster if it’s a headwind (negative). 2. Let’s prepare the 1500m data for analysis. (a) Read in the two files, and add a column to each indicating the sex of the athlete. Bind the two files together. (b) Some of the times were run at high altitude, and have an ‘A’ in the time. Delete all of these rows, and convert the time to numeric. Keep only the columns country, date, birth, time and sex. (c) Format the dates as above (with Year and Month factors too) and compute the age of the athlete (in Years) and store as a column. There seems to be one women who ran a race 3 years before she was born. That’s certainly a typo, so delete that row. At this point, you can delete anyone whose age is missing as well. (d) You should convert the times to seconds as well. R stores time objects as a time and a date (the current date you created the time) so if you just subtract ‘00:00’ from your time and cast as numeric, you’ll have the time in minutes. (e) Luckily, wind is not a factor in this race. Well, it is, but it’s not recorded and it rarely helps. 3. Now you have two identical data frames (if you ignore wind from the 100m data). Bind these data frames together into one frame, with-out wind. Before you do that, you should create a column in each called Race with an appropriate label, so that you can tell which times came from which race. Present the summary statistics obtained from summary() for this finished data frame, neatly formatted in your report. 4. Let’s see if there’s a season for running fast times in each of these races. Give a barplot (grouped by sex) for each race, showing the counts of top times by Month, in order, aggregated over all years. You should have two plots; one for each race. Make sure all of the months appear on the x-axis. Hint: If you’re using ddply(), you can pass it an argument .drop = F so the unused levels don’t get dropped. 5. Let’s have a look at what countries are producing fast times, for each race and sex. Present a separate two-way table for each race with Country along the rows and Sex across the columns, showing the counts of top times. Don’t show any countries that don’t have counts for either sex, and order the countries in decreasing order of total count for both sex. Show only the top 25 countries for each race. 6. We will examine the relationship between 100m and 1500m performance for a country, to see if they are related. (a) Make a new data frame, containing a column for Country, a column for Sex, and two columns with the total counts of top times for each race. Subset this data frame by country/sex pairings that have at least one count for both races. If you did this correctly, you should have 45 rows. (b) Present a scatterplot showing the log of counts for 100m top times along the x-axis, and log of counts for 1500m top times along y. Use a different colour (shade) and plotting symbol for male and female counts, and put the line of best fit for each sex through the respective points. This line should be coloured the same way (or shade or line type if you don’t want to print in colour) and the plot should have a legend. (c) Finally, fit a model regressing the log of 1500m counts on the log of 100m counts and sex. Start off with a model with both main effects and an interaction, and test the interaction with a partial Ftest. If it is not significant, remove it and test the additive model, removing anything that is not significant. Give the fitted equation for your final model and interpret the parameter estimate(s) in plain English. C Format Please make your submission look nice. This means: • Proper sentences free of typing errors • Graphs and tables should be in the body of the report, not thrown in at the end • R Code should be appended to the assignment, in a small fixed-width font like courier new Missing any of these items will forfeit all of the format marks.

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# B Athletics
# 1.
# (a)
# getwd()
# setwd([workplace])
tnf100M <- read.csv("tnf100M.csv", header=TRUE)
tnf100W <- read.csv("tnf100W.csv", header=TRUE)
names(tnf100M)
# [1] "ID"       "time"    "wind"    "Name"    "country" "birth"    "heat"    "Location" "date"   
tnf100M$sex = 0
tnf100W$sex = 1
tnf100 <- rbind(tnf100M, tnf100W)

# (b)

#removehighaltitude
tnf100 <- tnf100[-grep("A",tnf100$time),]
#as.numeric()
tnf100$time=as.numeric( as.character(tnf100$time) )

# (c)
tnf100 <- tnf100[,c("country","date","birth","wind","time","sex")]

# (d)
# install.packages("lubridate")
library(lubridate)...

By purchasing this solution you'll be able to access the following files:
Solution.R.

$75.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats