## Transcribed Text

6. Reading and Writing CSV files full of Data. It is often helpful to be able to quickly upload and download data from your computer into R and then write csv files full of data from R back to csv format. To do this we can use the funtions read.csv() and write.csv(). Data files should most easily be set up as text files with rows as cases and columns as variables. Datasets for this course will be found in this format on the course web site. Save them to a text file and use read.table to read them into R as data frames. We often need to be aware where R is going to save the files and where R is looking to upload the files from in terms of the directory where we store the information. As practice I will illustrate these concepts using teh iris dataset in R.
> # Look at the first 6 rows of the iris data frame
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
0.2 setosa
0.2 setosa
0.2 setosa
0.2 setosa
0.2 setosa
0.4 setosa
> # Look at the last 6 rows of the iris data frame
> tail(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
145 6.7 3.3 5.7
146 6.7 3.0 5.2
147 6.3 2.5 5.0
148 6.5 3.0 5.2
149 6.2 3.4 5.4
150 5.9 3.0 5.1
2.5 virginica
2.3 virginica
1.9 virginica
2.0 virginica
2.3 virginica
1.8 virginica
10
> # I'm going to create a csv file on my computer with this information
> # First I want to save the csv file to the directory ``C:/Users/David King/Desktop'' > # THIS IS MY DIRECTORY of course. Yours will be different.
> # The working directory is where R looks to upload and download files.
> # I can get my current working directory using the following command
> getwd()
[1] "C:/Users/David King/OneDrive/Stat 501"
> # To change my working directory, I use the following command
> # I want to save my directory as a character string
> direcname= getwd()
> newdirecname = "C:/Users/David King/Desktop"
> setwd(newdirecname)
>
> # Now if I were to write or read a file
> # it would save the file to that directory
Now I will write the iris dataset to a CSV file in the working directory. After you execute this command you should check your working directory to see if the csv file is there.
> write.csv(iris,"Iris.csv")
> write.csv(iris,"Iris.csv",row.names = FALSE)
We can read the csv file back into R and save it to a variable called “A” by using the following command.
> A= read.csv("Iris.csv")
> head(A)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
Let’s change our working directory back again
> setwd(direcname)
0.2 setosa
0.2 setosa
0.2 setosa
0.2 setosa
0.2 setosa
0.4 setosa
7. Graphics. R has functions to automatically plot many standard statistical graphics. Histograms and boxplots may be generated with hist and boxplot, respectively. Once you have a graphic you’re happy with, you can copy the entire thing. Make sure that the graphics window is the active (selected) window, and select ”Copy to clipboard as bitmap” from the file menu. You can then paste your figure into Word and resize to taste. There are 3 main graphics libraries to execute various graphics that I will go over: the base R way of doing graphics, the lattice library and the ggplot2 way of doing things. First I will go over how to do graphics in base R.
Let’s play around with the iris dataset first. Here is how we make a histogram if the “Petal.Length” variable in the iris dataset
> x = iris$Petal.Length
> print(x)
[1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
[19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
11
[37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
[55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
[73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
[91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1
> # another way of getting this information to x is
> x = iris[,4]
> # The following produces a histogram of the column of numbers x
> hist(x)
Histogram of x
0.0 0.5 1.0
Here is how we make a boxplot
> boxplot(x)
1.5 2.0 2.5
x
12
Frequency
0 5 10 15 20 25 30 35
The box part of the boxplot shows you graphically, the 1st quantile, median and the 3rd quantile. The inter quantile range = 3rd quantile - 1st quantile is the width of the box and it gives you a graphical sense of how much variation for that variable. The whiskers are the lines extending out from the box and the command ?boxplot can tell you more about those. When you draw many boxplots it is sometimes good to draw boxplots with notches. Here is how we make a boxplot for each species of flower.
> boxplot(Petal.Length~Species, data=iris)
# Formula description
13
0.5 1.0 1.5 2.0 2.5
setosa versicolor virginica
A boxplot with notches is more useful than those without notches. The width of the notch shows you the confidence interval associated with the median which allows you to quickly see if the differences between groups is significant.
> boxplot(Petal.Length~Species, data=iris, notch=TRUE) # Formula description
14
1234567
setosa versicolor virginica
Here is how we make a scatterplot using “Petal Length” for the Y axis and “Petal Width” for the x axis.
> plot(Petal.Length~Petal.Width, data=iris)
# Formula description
15
1234567
0.5 1.0 1.5 2.0 2.5
Petal.Width
The “ggplot2” package is an elegant package that allows you to make excelent graphics. To get the ggplot2 package to work you first need to install the ggplot2 package to your computer using the following command which I commented out.
> # install.packages("ggplot2")
Now load the package to your system using the library command as follows
> library("ggplot2")
The basic command to create many type of graphics is the ggplot() function. Unlike the base R function plot() you can store plots in R and then modify or add features to existing stored plots as you go. In our exploration of the ggplot2 package let’s play around with the diamonds data set. To look at the first 6 rows of the diamonds data set perform the following R command.
> head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
16
Petal.Length
1234567
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Now to create a plot object using ggplot2 we need to tell the computer: (a) What dataset we want to plot?
(b) What variable we want to use for the x axis?
(c) What variable we want to use for the y axis?
(d) What kind of plot or “geom” to use?
We can tell the computer the first 3 things as follows
> P=ggplot(data=diamonds,mapping=aes(x=carat,y=price))
Now the plot “P” is stored in the computer. You’ve told the system that you want to plot something with the diamonds dataset, using carat as the x-axis and price as the y-axis. At this stage you haven’t told the computer what kind of “geom” to use so it won’t plot anything yet. We can get a list of all kinds of geoms using the “??geom” command which
you might want to try and parouse through.
To tell R that you want to create a scatter plot, we select the jitter geom and modify the object P as follows
> P=P+geom_jitter()
Notice that no plot has showed up yet!! Thus far, we have mearly saved the plot in the system as “P”. If we want to plot it, we just type “P” and it prints the plot to your plot device.
> print(P)
17
15000
10000
5000
0
012345
carat
We can draw a jitter plot using different colors for each type of cut using the following R commands
> P=ggplot(data=diamonds,mapping=aes(x=carat,y=price,col=cut)) > P=P+geom_point()
>P
18
price
15000
10000
5000
0
012345
carat
cut
Fair
Good
Very Good Premium Ideal
Now if we want to have separate plots for each cut of diamond and use the variable “color” to color the dots in the graph we can use the facet grid command as follows.
> P=ggplot(data=diamonds,mapping=aes(x=carat,y=price,col=color)) > P = P+geom_jitter()+facet_grid(.~cut)
>P
19
price
Fair
Good
Very Good
Premium
Ideal
15000
10000
5000
0
color D
E F G H I J
012345012345012345012345012345
carat
Here is me adding a smoother to my scatter plots (silly me)
> P=ggplot(data=diamonds,mapping=aes(x=carat,y=price,col=color)) > P = P+geom_jitter()+geom_smooth()
>P
20
price
Fair Good
Very Good
Premium
Ideal
4e−04
3e−04
2e−04
1e−04
0e+00
0 50010000105000 0 50010000105000 0 50010000105000 0 50010000105000 0 50010000105000
price
There are many geoms available for use and you can try some. Here is an example of some density plots of the price of diamonds with a rug drawn underneath. We will go into this more later.
> K=ggplot(data=diamonds,mapping=aes(x=price))
> K = K+geom_density()+geom_rug()+facet_grid(.~cut) >K
density
21
Fair Good
Very Good
Premium
Ideal
4e−04
3e−04
2e−04
1e−04
0e+00
0 50010000105000 0 50010000105000 0 50010000105000 0 50010000105000 0 50010000105000
price
8. Normal Distribution. I will now work some problems in R relative to the normal distribution. As mentioned before we can randomly generate normally distributed random variables using the function rnorm().
> # Randomly generate 100 normally distributed random variables
> # with mean = 0 and standard deviation = 1.
> z = rnorm(100)
> # Randomly generate 100 normally distributed random variables
> # with mean = 20 and standard deviation = 5
> x = rnorm(100,20,5)
The function dnorm() is the probability density function for the normal distribution. This is the good ol’ bell function that you maybe aware of. we can give a plot of the exact probability density function from -3 to 3 using the curve() function which is displayed below.
> # Read the help screen for dnorm by typing
> # ?dnorm
> # We can plot dnorm by using
> curve(dnorm(x),from=-3,to=3)
density
22
−3 −2 −1 0 1 2 3
x
In the code below I have constructed a user-defined function which will shade in an area under the normal distribution between two x values. The function below will shade in the area under the curve of the normal distribution between x = a and x = b and also return the area under the curve. The area under the total probability density function (PDF) is 1. To get this funtion to work, just copy and paste the function into your R console and then you can use it.
> shadebetween = function(a,b){
+ # This function shades an area under the normal distribution
+ #betweenx=aandx=b.
+ # The function returns the area under the curve between these two points as well
+ xmin = min(c(a,b,-3))
+ xmax = max(c(a,b,3))
+ # Produce plot of curve
+ curve(dnorm(x),from=xmin,to=xmax)
+
+ xx = seq(a,b,length.out=100)
+ yy = dnorm(xx)
+ xx=c(a,xx,b)
+ yy=c(0,yy,0)
+ polygon(xx,yy,col="light blue")
23
dnorm(x)
0.0 0.1 0.2 0.3 0.4
+ area = abs(pnorm(b) - pnorm(a))
+ xmid = (a+b)/2
+ ymid = dnorm(xmid)/2
+ text(xmid,ymid,round(area,4))
+ return(area) +}
Here is an example. Find the probability that a standard normal random variable, Z ∼ N(0,1) is between 1 and 2.3 and shade in the area under the normal distribution. Here we are interested in the probability P(1 < Z < 2.3). We can get this by using the R command
> shadebetween(1,2.3)
[1] 0.1479311
0.1479
−3 −2 −1 0 1 2 3
x
The function pnorm(a) returns the cummulative probability that a standard random variable is less than a. I.e. this returns the integral F (a) = a f (x)dx where f (x) represents the density of the normal. This implies that we can use
−∞
the pnorm() function to give us probabiliites associated with the normal distribution. For example, if we were interested in the following probabilities, we could answer these using the following commands in R.
24
dnorm(x)
0.0 0.1 0.2 0.3 0.4
(a) Find the probability to the right of z = −0.85.
Since P (Z > −0.85) = ∞ f (x)dx = 1 − −0.85 f (x)dx = 1 − pnorm(−0.85) we have
−0.85 > prob = 1 - pnorm(-0.85)
> prob
[1] 0.8023375
−∞
> # Since most of the area of the standard normal is between -3 and 3 a good picture for this
> # can be produced with the following shadebetween command
> # (not much is lost by assuming area to left of -3 is approx 0)
> shadebetween(-3,-0.85)
[1] 0.1963126
0.1963
−3 −2 −1 0 1 2 3
x
(b) Between z = 0.40 and z = 1.30.
Since P(0.4 < Z < 1.3) = 1.3 f(x)dx = 1.3 f(x)dx−0.4 f(x)dx = F(1.3)−F(0.4) = pnorm(1.3)−pnorm(0.4)
0.4 −∞
−∞
25
dnorm(x)
0.0 0.1 0.2 0.3 0.4
> prob = pnorm(1.3) - pnorm(0.4)
> prob
[1] 0.2477778
> shadebetween(0.4,1.3)
[1] 0.2477778
0.2478
−3 −2 −1 0 1 2 3
x
(c) Between z = −0.30 and z = 0.90.
Since P(−0.3 < Z < 0.9) = 0.9 f(x)dx = 0.9 f(x)dx − −0.3 f(x)dx = F(0.9) − F(−0.3) = pnorm(0.9) −
−0.3
pnorm(−0.3)
> prob = pnorm(0.9) - pnorm(-0.3)
> prob
[1] 0.4338513
−∞ −∞
26
dnorm(x)
0.0 0.1 0.2 0.3 0.4
> shadebetween(-0.3,0.9)
[1] 0.4338513
0.4339
−3 −2 −1 0 1 2 3
x
The area inside the interval is pnorm(−0.45) − pnorm(−1.50). Since the total area under the curve is 1, the area outside the interval is 1 − (pnorm(−0.45) − pnorm(−1.50)). If we let the white area under the curve denote the answer, rather than the blue area, then we can use shadebetween(-0.45,-1.5) to give us the answer. However, the shaded region is in white rather than blue!
> prob = 1 - (pnorm(-0.45) - pnorm(-1.50))
> prob
[1] 0.740452
> shadebetween(-0.45,-1.5)
[1] 0.259548
(d) Outside of the interval z = −1.50 to z = −0.45.
27
dnorm(x)
0.0 0.1 0.2 0.3 0.4
> x=c() >
>
>
0.2595
Now it is your Turn. Here are Your Problems
−3 −2 −1 0 1 2 3
x
1. Use R as a calculator to compute the following values. After you do so, cut and paste your function input and output from R to Word. I want to see the code you used to get the answer. Add numbering in Word to identify each part of each problem. (Do this for every problem from now on.)
(a) 27(38−17) (b) ln(147)
(c) 436 12
2. Create the following vectors in R. a = (5, 10, 15, 20, . . . , 160) b = (87, 86, 85, . . . , 56) Use vector arithmetic to multiply together these vectors together element-by element and call the result d. Select subsets of d to identify the following.
28
dnorm(x)
0.0 0.1 0.2 0.3 0.4
(a) What are the 19th, 20th, and 21st elements of d?
(b) What are all of the elements of d which are less than 2000?
(c) How many elements of d are greater than 6000?
3. Using d from the previous problem, use R to compute the following statistics of d:
(a) sum (b) mean
(c) median (d) variance
(e) standard deviation
4. Let A and B denote the matrices of size (2 × 3) and (3 × 4), respectively, with
7 9 12
A= 2 4 13 and, (1)
1 7 12 19
B=2 8 13 20. (2)
3 9 14 21
Compute the matrix product AB using R.
5. For this problem you will be working with the mtcars data set. You can look at the first 6 rows of the mtcars data set
as follows
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
(a) Pick a directory folder, you’re favorite directory folder, and then write the mtcars data set to that folder using the “write.csv” function in R. Once your done writing the data to that folder then read the data back again using the “read.csv” command. Make sure to change your working directory!
(b) Create categorical factor variables for the number of cylinders and the number of gears of the car and add that those factor variables to the mtcars data set.
(c) Create the following plots using either ggplot or regular base R:
Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Do the following
29
i. A box and whisker plot of mpg by number of cylinders ii. A strip plot of mpg by number of gears
iii. Draw a scatter plot of mpg vs weight and add a smoother.
6. Find the area under the standard normal curve
(a) To the right of z = −0.85.
(b) Between z = 0.40 and z = 1.30.
(c) Between z = −0.30 and z = 0.90. (d) Outside z = −1.50 to z = −0.45.
7. Let Z ∼ N(0,1). For each of these problems it will be helpful to draw a picture and mark the area along with the constant c along the x-axis. Find a constant c for which
(a) P(Z≤c)=0.87 (b) P(Z ≥ c) = 0.1587
(c) P(c≤Z ≤0)=0.4772 (d) P(−c≤Z ≤c)=0.8664
(e) P(0≤Z ≤c)=0.2967
(f) P(|Z|≥c)=0.1470
8. IfX∼N(2,9)compute:
(a) P(X≥2)
(b) P(1≤X<7)
(c) P(−2.5≤X<−1)
9. Penicillin is produced by the Penicillium fungus, which is grown in a broth whose sugar content must be carefully controlled. The optimum sugar concentration is 4.9 mg/mL. If the concentration exceeds 6.0 mg/mL, the fungus dies and the process must be shut down for the day.
(a) If sugar concentration in batch of broth is normally distributed with mean 4.9 mg/mL and standard deviation 0.6 mg/mL, on what proportion of days will the process shut down?
(b) The supplier offers to sell broth with a sugar content that is normally distributed with mean 5.2 mg/mL and standard deviation 0.4 mg/mL. Will this alternative broth result in fewer days of production lost? Explain with evidence.
10. The Robey dataset in the car library package looks at Fertility and Contraception in developing nations around 1990. (a) Generate side-by-side boxplots to compare contraceptors for different regions. Label the axes accordingly and
provide a title. Which region has the highest median percentage of contraceptors?
30
(b) Plot the percent of contraceptors among married women of childbearing age versus total fertility rate. Make sure to appropriately label your axes and provide a title. Is there an association between these two variables? If so, what kind of trend do you see with respect to the variables in question?
31

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.

Question 6

# (a)

pnorm(-0.85)

## [1] 0.1976625

# (b)

pnorm(1.30) - pnorm(0.40)

## [1] 0.2477778

# (c)

pnorm(0.90) - pnorm(-0.30)

## [1] 0.4338513

# (d)

1 - (pnorm(-0.45) - pnorm(-...