QuestionQuestion

1. Read the airquality.csv data set into R. This data set records the ozone level, solar radiation, wind speed, and temperature (in degrees Fahrenheit) over five months (from month 5 = May to month 9 = September).

a) Read in the “airquality.csv” data and save it in an object called air. Examine the missing values in the data set using the md.pattern() command to answer the following:

i. How many entries are missing altogether?
ii. What is the variable with the most missing values?
iii. Which sets of variables are missing at the same time? How many times are they
missing?

b) Find the mean of each of the variables in the air quality data set using pairwise deletion.

c) Initialize a new data set called air.median from the air data set (for example: air.median<-air). Impute the missing solar radiation values with the MEDIAN of the non-missing radiations in the air.median data set.

d) Initialize a new data set called air.mean from the air.median data set . Impute the missing temperature values with the mean temperature for the month that the temperature is missing from in the air.mean data set. For example, impute missing month 5 temperature values with the mean of the non-missing temperatures for month 5.

e) Initialize a new data set called air.ratio from the air.mean data set. Impute the missing values of the Ozone variable using ratio imputation in the air.ratio data set (let the correlated complete variable be temperature).

f) Initialize a new data set called air.complete from the air.ratio data set. Use linear regression to impute the missing values of Wind using Ozone as the independent variable in the air.complete data set.

g) Check the air.complete data set for missing values. (If you have done the question correctly you should have no NAs remaining!) Find the mean of each of the variables in the air.complete data set.

h) Starting with the original “airquality.csv” data set, use the mice package in R to impute missing data using m = 5 and seed = 2. Save the imputed data set in an object called imputeddata. Extract each of the five sets of imputed values using the complete() function and then create a data frame called air.complete2 with all of the imputed data sets (the data frame should have 765 rows). Find the mean of each of the variables in the air.complete2 data set.

i) Compare the means of all of the variables in the air.complete and air.complete2 data sets from part g) and part h). Do you think one set of imputed values is better than the other?

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# 1.
airquality <- read.csv("airquality.csv")
head(airquality)
names(airquality)
# [1] "Ozone"   "Solar.R" "Wind" "Temp"    "Month"   "Day"
dim(airquality)
# [1] 153   6
library(mice)
## a)
(md <- md.pattern(airquality, plot = TRUE, rotate.names = FALSE))
#    Month Day Temp Solar.R Wind Ozone   
# 96    1   1    1       1    1    1 0
# 31    1   1    1       1    1    0 1
# 11    1   1    1       1    0    1 1
# 3      1   1    1       1    0    0 2
# 5      1   1    1       0    1    1 1
# 2      1   1    1       0    1    0 2
# 2      1   1    0       1    1    1 1
# 1      1   1    0       1    1    0 2
# 2      1   1    0       1    0    1 2
#       0   0    5       7   16    37 65
### i.
# [1] 65

### ii)
# Apparently, `Ozone` with 37 missing.

### iii)
row.names(md)
# [1] "96" "31" "11" "3" "5" "2" "2" "1" "2"

## b)
colMeans(airquality, na.rm=TRUE)
#    Ozone    Solar.R       Wind       Temp      Month       Day
# 42.129310 185.931507   9.978832 77.912162   6.993464 15.803922

## c)
air.median <- airquality
air.median$Solar.R <- median(air.median$Solar.R, na.rm=TRUE)

air.median <- apply(air.median, 2, function(x) {
x[is.na(x)==TRUE] <- median(x, na.rm=TRUE)
x
})
head(air.median)...

By purchasing this solution you'll be able to access the following files:
Solution.R.

50% discount

Hours
Minutes
Seconds
$56.00 $28.00
for this solution

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Statistics-R Programming Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats