## Question

a) Read in the “airquality.csv” data and save it in an object called air. Examine the missing values in the data set using the md.pattern() command to answer the following:

i. How many entries are missing altogether?

ii. What is the variable with the most missing values?

iii. Which sets of variables are missing at the same time? How many times are they

missing?

b) Find the mean of each of the variables in the air quality data set using pairwise deletion.

c) Initialize a new data set called air.median from the air data set (for example: air.median<-air). Impute the missing solar radiation values with the MEDIAN of the non-missing radiations in the air.median data set.

d) Initialize a new data set called air.mean from the air.median data set . Impute the missing temperature values with the mean temperature for the month that the temperature is missing from in the air.mean data set. For example, impute missing month 5 temperature values with the mean of the non-missing temperatures for month 5.

e) Initialize a new data set called air.ratio from the air.mean data set. Impute the missing values of the Ozone variable using ratio imputation in the air.ratio data set (let the correlated complete variable be temperature).

f) Initialize a new data set called air.complete from the air.ratio data set. Use linear regression to impute the missing values of Wind using Ozone as the independent variable in the air.complete data set.

g) Check the air.complete data set for missing values. (If you have done the question correctly you should have no NAs remaining!) Find the mean of each of the variables in the air.complete data set.

h) Starting with the original “airquality.csv” data set, use the mice package in R to impute missing data using m = 5 and seed = 2. Save the imputed data set in an object called imputeddata. Extract each of the five sets of imputed values using the complete() function and then create a data frame called air.complete2 with all of the imputed data sets (the data frame should have 765 rows). Find the mean of each of the variables in the air.complete2 data set.

i) Compare the means of all of the variables in the air.complete and air.complete2 data sets from part g) and part h). Do you think one set of imputed values is better than the other?

## Solution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

# 1.airquality <- read.csv("airquality.csv")

head(airquality)

names(airquality)

# [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"

dim(airquality)

# [1] 153 6

library(mice)

## a)

(md <- md.pattern(airquality, plot = TRUE, rotate.names = FALSE))

# Month Day Temp Solar.R Wind Ozone

# 96 1 1 1 1 1 1 0

# 31 1 1 1 1 1 0 1

# 11 1 1 1 1 0 1 1

# 3 1 1 1 1 0 0 2

# 5 1 1 1 0 1 1 1

# 2 1 1 1 0 1 0 2

# 2 1 1 0 1 1 1 1

# 1 1 1 0 1 1 0 2

# 2 1 1 0 1 0 1 2

# 0 0 5 7 16 37 65

### i.

# [1] 65

### ii)

# Apparently, `Ozone` with 37 missing.

### iii)

row.names(md)

# [1] "96" "31" "11" "3" "5" "2" "2" "1" "2"

## b)

colMeans(airquality, na.rm=TRUE)

# Ozone Solar.R Wind Temp Month Day

# 42.129310 185.931507 9.978832 77.912162 6.993464 15.803922

## c)

air.median <- airquality

air.median$Solar.R <- median(air.median$Solar.R, na.rm=TRUE)

air.median <- apply(air.median, 2, function(x) {

x[is.na(x)==TRUE] <- median(x, na.rm=TRUE)

x

})

head(air.median)...

By purchasing this solution you'll be able to access the following files:

Solution.R.