## Transcribed Text

title: "Problem Set 7 Prep"
author:
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(MASS)
library(caret)
library(DescTools)
```
The Pew science news survey will be used in the context of modeling "BLAME" on a (somewhat haphazard) collection of other responses. The interpretations of the variables are available in the codebook for the survey.
The "BLAME" variable is the selection made in response to the following question.
"Which of these do you think is the BIGGER problem when it comes to news about scientific research findings? [RANDOMIZE RESPONSE OPTIONS]
1 The way news reporters cover scientific research findings
2 The way science researchers publish or share their new research findings"
Many of the variables are have ordered categorical responses given by numbers with 3-5 categories and, in some cases, categories for declined responses. For the purposes of this model-building effort, these will be treated as numeric variables. Such variables as explanatory variables are often used as categorical variables.
After some data cleaning, the data set is restricted to complete cases. This limits generalization of the results to individuals who would provide responses to the selected variables.
```{r}
load("../ps2/pew_data.RData")
nam.keep<-names(dat)[str_detect(names(dat),"TOPIC|SCIOFTEN|ENJOY|^SOURCE|KNOWLEDGE|SEEK|SNSUSE|SCIWHY|WHYNOT|NEWSJOB|NEWSFACTS|STORIES|SNSUSE|SNSFREQ|FOLLOW|FOLLOWANTI|SNSSCI|SNSSCIIMP|PROBSET|BLAME|IDEO|AGE|PPREG4|PPINCIMP|PPGENDER|PPETHM|PPEDUCAT")]
dat.trim<-dplyr::select(dat,nam.keep)
dat.trim<-dat.trim[dat.trim$BLAME>0,] # Drop non-responses in the outcome variable
dat.trim$BLAME<-(dat.trim$BLAME==1)*1 # 1: blame reporters, 0: blame scientists
dat.trim<-dplyr::select(dat.trim,-SCIWHY_f)# asked of a subset of respondents
na.ct<-lapply(dat.trim,function(x){sum(is.na(x))})# Identify variables with many NAs
# names(dat.trim)[na.ct>5]
# Replace NAs that have values that can be inferred from other variables
dat.trim$SNSFREQ[dat.trim$SNSUSE==2]<-5
dat.trim$FOLLOW[dat.trim$SNSUSE==2]<-2
dat.trim$FOLLOWANTI[dat.trim$SNSUSE==2]<-2
dat.trim$SNSSCI[dat.trim$SNSUSE==2]<-4
dat.trim$SNSSCIIMP[dat.trim$SNSSCI==4]<-4
# Continue with complete cases. This restricts generalization.
dat.trim<-drop_na(dat.trim)
level.ct<-lapply(dat.trim,function(x){length(unique(x))})
#names(dat.trim)[level.ct==2]
dat.trim<-mutate_all(dat.trim,function(x){x[x<0]<-NA;x})
dat.trim<-drop_na(dat.trim)
# Create an indicator for the response "6" in the NEWSFACTS variables.
# The value "6" codes "I don’t know enough about this type of source to rate."
newsdk<-dplyr::select(dat.trim,str_c("NEWSFACTS_",letters[1:9]))
names(newsdk)<-str_c(names(newsdk),"_dk")
newsdk<-mutate_all(newsdk,function(x){(x==6)*1})
dat.trim<-bind_cols(dat.trim,newsdk)
dat.trim$PPETHM<-factor(dat.trim$PPETHM)
dat.trim$PPREG4<-factor(dat.trim$PPREG4)
# Split into training and test data, preserving proportions of the "BLAME" responses.
set.seed(3456)
trainIndex <- createDataPartition(dat.trim$BLAME, p = .8,
list = FALSE,
times = 1)
dat.train<-dat.trim[trainIndex,]
dat.test<-dat.trim[-trainIndex,]
```
Fit a forward model on all the variables. This takes a few minutes.
```{r}
nam<-setdiff(names(dat.train),"BLAME")
m<-glm(BLAME~1,data=dat.train,family="binomial")
# Use stringr to avoid typing all the explanatory variables.
fmla<-str_c("BLAME~",str_c(nam,collapse="+"))
m.forward<-step(m,scope=fmla,direction = "forward" ,trace=0)
summary(m.forward)
```
## Select forward model with cross validation
The basic organization is to fit forward models on each fold, select the model size that has the lowest pooled cross-validated deviance, then fit the forward model of that size on the training data. The steps here follow the steps in "cross_validation_5_3_2.Rmd".
Write a function that fits forward models of up to 40 variables on a given data set and evaluates the sum of the squared prediction errors on another data set. (In the interest of readability, these functions are specific to the current problem and would need to be modified for a new application.)
```{r}
# create a formula for "BLAME" in terms of sequences of variables in "vars.add"
fmla.add.fnc<-function(i,vars.add){
vars.in<-vars.add[1:i]
return(as.formula(str_c("BLAME~",str_c(vars.in,collapse="+"))))
}
# function to calculate validation set deviance
deviance.valid<-function(m,dat.valid){
pred.m<-predict(m,dat.valid, type="response")
-2*sum(dat.valid$BLAME*log(pred.m)+(1-dat.valid$BLAME)*log(1-pred.m))
}
# function to fit a sequence of 40 forward models with scope "fmla.this
# on the data set "dat.this" and
# return the deviance for each model on "dat.valid.this".
deviance.get<-function(dat.this,fmla.this,dat.valid.this){
m.forward<-step(glm(BLAME~1,data=dat.this,family="binomial"),scope=fmla.this,
k=0,steps=40, direction="forward",trace=0)
# Collect the variables used in the order in which they were added
vars.add<-m.forward$anova[,1]
vars.add<-str_replace(vars.add,"\\+ ","")
vars.add[1]<-1 # Intercept only
# Apply "fmla.add.fnc" to each value of "i". This
# gives the formulas for the models generated initial sequences of the
# variables in vars.add
fmlas<-apply(matrix(1:length(vars.add),ncol=1),1,
fmla.add.fnc,vars.add=vars.add)
# Make a list of models corresponding to these formulas.
models<-
lapply(fmlas,function(x){glm(x,data=dat.this,family="binomial")})
# Calculate the deviance on "dat.valid" for each model.
return(sapply(models,deviance.valid,dat.valid=dat.valid.this))
}
```
Write a wrapper function for deviance.get that splits training and validation based on the index "i" and calls "deviance.get" with the training data as the folds omitting the ith and the ith fold as the validation data.
```{r}
deviance.wrapper<-function(i,dat.w,ind.w,fmla.w){
return(deviance.get(dat.w[ind.w!=i,],fmla.w,
dat.w[ind.w==i,]))
}
```
Write a function that creates a fold structure, applies the wrapper function to each fold, sums the corresponding model deviances, and returns the vector of sums.
```{r}
# deviance.sums.xv<-function(dat.this,fmla.this){
# n<-nrow(dat.this)
# fold<-8
# ind<-rep(1:fold, ceiling(n/fold))
# ind<-ind[1:n]
# ind<-sample(ind)
# xv.mat<-apply(matrix(1:8,ncol=1),1,deviance.wrapper,dat.w=dat.this,
# ind.w=ind,
# fmla.w=fmla.this)
# return(apply(xv.mat,1,sum))
# }
# set.seed(1234)
# fwd<-deviance.sums.xv(dat.train,fmla)
# save(fwd,file="forward_deviances.RData")
load("forward_deviances.RData")
which.min(fwd) # 26
plot(fwd)
forward.model.xv<-step(glm(BLAME~1,data=dat.train,family="binomial"),scope=fmla,
direction="forward",steps=which.min(fwd)-1,trace=0)
summary(forward.model.xv)
PseudoR2(forward.model.xv,which="McFadden")
nrow(forward.model.xv$anova)
forward.model.comp<-step(glm(BLAME~1,data=dat.train,family="binomial"),scope=fmla,
direction="forward",steps=29,trace=0)
forward.model.comp$anova[,1]
summary(forward.model.comp)
```
---
title: "Problem Set 7"
author:
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
word_document: default
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(GGally)
library(ggpubr)
library(lmtest)
library(MASS)
library(caret)
library(DescTools)
library(glmnet)
```
## Introduction
Please complete the following tasks regarding the data in R. Please generate a solution document in R markdown and upload the .Rmd document and a rendered .doc, .docx, or .pdf document. Please turn in your work on Canvas. Your solution document should have your answers to the questions and should display the requested plots. Also, please upload the .RData file you generate.
## Question 1
The Pew science news survey will be used in the context of modeling "BLAME" on a (somewhat haphazard) collection of other responses. The interpretations of the variables are available in the codebook for the survey.
The "BLAME" variable is the selection made in response to the following question.
"Which of these do you think is the BIGGER problem when it comes to news about scientific research findings? [RANDOMIZE RESPONSE OPTIONS]
1 The way news reporters cover scientific research findings
2 The way science researchers publish or share their new research findings"
The interpretations of the remaining variables are available in the codebook for the survey. Many of the variables are have ordered categorical responses given by numbers with 3-5 categories and, in some cases, categories for declined responses. For the purposes of this model-building effort, these will be treated as numeric variables. Such variables as explanatory variables are often used as categorical variables.
After some data cleaning, the data set is restricted to complete cases. This limits generalization of the results to individuals who would provide responses to the selected variables.
```{r}
load("../ps2/pew_data.RData")
nam.keep<-names(dat)[str_detect(names(dat),"TOPIC|SCIOFTEN|ENJOY|^SOURCE|KNOWLEDGE|SEEK|SNSUSE|SCIWHY|WHYNOT|NEWSJOB|NEWSFACTS|STORIES|SNSUSE|SNSFREQ|FOLLOW|FOLLOWANTI|SNSSCI|SNSSCIIMP|PROBSET|BLAME|IDEO|AGE|PPREG4|PPINCIMP|PPGENDER|PPETHM|PPEDUCAT")]
dat.trim<-dplyr::select(dat,nam.keep)
dat.trim<-dat.trim[dat.trim$BLAME>0,] # Drop non-responses in the outcome variable
dat.trim$BLAME<-(dat.trim$BLAME==1)*1 # 1: blame reporters, 0: blame scientists
dat.trim<-dplyr::select(dat.trim,-SCIWHY_f)# asked of a subset of respondents
na.ct<-lapply(dat.trim,function(x){sum(is.na(x))})# Identify variables with many NAs
# names(dat.trim)[na.ct>5]
# Replace NAs that have values that can be inferred from other variables
dat.trim$SNSFREQ[dat.trim$SNSUSE==2]<-5
dat.trim$FOLLOW[dat.trim$SNSUSE==2]<-2
dat.trim$FOLLOWANTI[dat.trim$SNSUSE==2]<-2
dat.trim$SNSSCI[dat.trim$SNSUSE==2]<-4
dat.trim$SNSSCIIMP[dat.trim$SNSSCI==4]<-4
# Continue with complete cases. This restricts generalization.
dat.trim<-drop_na(dat.trim)
level.ct<-lapply(dat.trim,function(x){length(unique(x))})
#names(dat.trim)[level.ct==2]
dat.trim<-mutate_all(dat.trim,function(x){x[x<0]<-NA;x})
dat.trim<-drop_na(dat.trim)
# Create an indicator for the response "6" in the NEWSFACTS variables.
# The value "6" codes "I don’t know enough about this type of source to rate."
newsdk<-dplyr::select(dat.trim,str_c("NEWSFACTS_",letters[1:9]))
names(newsdk)<-str_c(names(newsdk),"_dk")
newsdk<-mutate_all(newsdk,function(x){(x==6)*1})
dat.trim<-bind_cols(dat.trim,newsdk)
dat.trim$PPETHM<-factor(dat.trim$PPETHM)# Ethnic group
dat.trim$PPREG4<-factor(dat.trim$PPREG4) # region of the country
# Split into training and test data, preserving proportions of the "BLAME" responses.
set.seed(3456)
trainIndex <- createDataPartition(dat.trim$BLAME, p = .8,
list = FALSE,
times = 1)
dat.train<-dat.trim[trainIndex,]
dat.test<-dat.trim[-trainIndex,]
```
The following fits a forward model for "BLAME" on all the variables.
```{r cache=TRUE}
nam<-setdiff(names(dat.train),"BLAME")
m<-glm(BLAME~1,data=dat.train,family="binomial")
# Use stringr to avoid typing all the explanatory variables.
fmla<-str_c("BLAME~",str_c(nam,collapse="+"))
m.forward<-step(m,scope=fmla,direction = "forward" ,trace=0)
summary(m.forward)
```
The number variables added in the full forward model:
```{r}
nrow(m.forward$anova)-1
```
### Q1, part 1
Using cross validation to select the number of steps yields a range of values. Repetition of the code in "ps7_prep.Rmd" with different seeds yielded values between 12 and 26, corresponding to 11 and 25 variables. Please fit the models using the first 11 variables added and the first 25. (Note that you don't have to run forward selection again if you capture the variables added in "m.forward") (10 points)
```{r}
# m11<- your code here
# m25<- your code here
```
### Q1, part 2
Using the "m25" model, please estimate the probability that "BLAME"=1 in "dat.train" for the observations in rows 306, 1601, and 2365 (selected to have similar estimates close to .5). If you resample the values of the coefficients according the multivariate Normal distribution with mean equal to the fitted coefficients for "m25" and covariance equal to the covariance matrix 1000 times and re-estimate the probabilities using the new coefficients, what proportion of the probabilities for case 2365 are less than .5? (10 points)
```{r}
```
### Q1, part 3
In what proportion of the estimates is the estimated probabilities for cases 306, 1601, and 2365 in the same relative order as the original estimates, $p_{306}\geq p_{1601}\geq p_{2365}$ (10 points)
```{r}
```
### Question 2
This question addresses the same modeling problem as question 1, but uses ridge and lasso regression.
### Q2, part 1
Please use "cv.glmnet" fit a cross-validated lasso logistic regression model of "BLAME" on the remaining variables on "dat.train". What variables have non-zero coefficients at the value of $\lambda$ designated as "lambda.1se" by "cv.glmnet" that results in the smallest deviance on the training set? (10 points)
```{r}
```
### Q2, part 2
Please find the deviance on "dat.train" and "dat.test" of the cross-validated lasso model corresponding to the value of "lambda.1se" above and the model "m25" fit in Q1, part 1. Based on the deviances, which model performs better on the training data? Based on the deviances, which model performs better on the test data (actually treating it as validation data)? (10 points)
Set up:
```{r}
```
lasso
```{r}
```
m25
```{r}
```
---
title:
subtitle:
author:
output:
word_document: default
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
RNGversion("3.6.2")
library(MASS)
library(tidyverse)
library(leaps)
library(ggpubr)
library(GGally)
library(caret)
library(glmnet)
library(corrplot)
library(survival)
library(survminer)
```
## Instructions
Please complete the questions on this template and upload your solutions in a single knitted Word or pdf document. Please also upload your completed template.
In light of the exam context, the data sets for the questions have been generated clearly to satisfy or obviously to violate the requirements of the statistical procedures. If reasonable exploratory analysis is done, there should be little ambiguity as to whether the given data satisfy the requirements. This is unrealistic, but less stressful for students and graders alike. For consistency, the random number generation version is specified in the setup block above.
## Question 1, model selection for multiple regression (36 points total)
The data sets "dat1train", "dat1valid, and "dat1test", loaded below, have explanatory variables "x1", "x2", and "x3" and a numeric outcome variable, "y". The explanatory variables have been standardized to have sample mean equal to 0 and sample standard deviation equal to 1 on the training data.(The data "dat1train" prior to standardization, was used in question 5 on the midterm for spring 2020.)
The goal is to use the validation set to select a linear regression model or penalized linear regression model from among those fit on the training data, then to estimate the generalization error by applying the selected model to predict the outcomes on the test data.
The parts of this question fit a backward model based on cross-validation, two lasso models and 9 best subsets models.
The plot below may be used to examine the adequacy of the size of the training set.
```{r}
load("dat1train.RData")
load("dat1valid.RData")
load("dat1test.RData")
size.mat<-matrix(seq(50,nrow(dat1train),by=10),ncol=1)
mse.get<-function(sz){
m.this<-lm(y~.,data=dat1train[1:sz,])
mse.train.this<-mean(m.this$residuals^2)
mse.valid.this<-mean((predict(m.this,dat1valid)-dat1valid$y)^2)
return(c(mse.train.this,mse.valid.this))
}
mses.mat<-apply(size.mat,1,mse.get)
dat.mses<-data.frame(t(mses.mat))
names(dat.mses)<-c("training","validation")
dat.mses$size.train<-size.mat
dat.mses<-pivot_longer(dat.mses,cols=training:validation,
names_to = "data.set",values_to = "mse")
ggplot(dat.mses,aes(x=size.train,y=mse,color=data.set))+geom_line()
```
### Q1, part 1, Backward cross-validation (9 points)
Code for repeated 10-fold cross-validation (caret suite) to select a size of backward multiple regression model for "y" on the remaining variables in "dat1train", their pairwise interactions, and their squares is provided, along with code to generate the fitted backward model with the selected number of variables. Please evaluate the performance on validation set "dat1valid" using mean squared prediction error. (The code is not generously commented. Understanding the effect of the code is part of the assignment.)
```{r}
fmla<-as.formula("y~(x1+x2+x3)^2+I(x1^2)+I(x2^2)+I(x3^2)")
mat.train<-as.data.frame(model.matrix(fmla,dat1train))
mat.train$`(Intercept)`<-dat1train$y
names(mat.train)[1]<-"y"
set.seed(1234)
BackwardFit <- train(
y~.,
data = mat.train,
method = "leapBackward",
trControl = trainControl(method = "repeatedcv", repeats = 10, number=10),
tuneLength=9
)
m.backward<-step(lm(fmla,data=dat1train),scope=as.formula("y~1"),
direction="backward",k=50,
steps=9-(BackwardFit$bestTune$nvmax-1),trace=0)
summary(m.backward)
#(mse.valid<-#### your code here #### )
```
### Q1, part 2, lasso regression (9 points)
The code below uses cross-validated lasso regression as in cv.glmnet to fit lasso-penalized linear models of "y" as a function of the remaining variables, their pairwise interactions, and their squares on the training set. Please report the mean squared error on the validation set of the model corresponding to "lambda.1se", "lambda.min".
```{r}
# Format data for glmnet.
dat1all<-bind_rows(dat1train,dat1valid,dat1test)
mat.all<-model.matrix(fmla,dat1all)
X<-mat.all[,-1]
Xtrain<-X[1:300,]
Xvalid<-X[301:400,]
Xtest<-X[401:500,]
ytrain<-dat1train$y
yvalid<-dat1valid$y
ytest<-dat1test$y
# Fit lasso-penalized models
set.seed(5678)
cvfit = cv.glmnet(x=Xtrain, y=ytrain,alpha=1)
plot(cvfit)
cvfit$lambda.min
cvfit$lambda.1se
# Calculate the mean squared error on the validation data for the "lambda.min" model and the "lambda.1se" model.
cvpred<-predict(cvfit,Xvalid,c(cvfit$lambda.min,cvfit$lambda.1se))
```
### Q1, part 3 (9 points)
The code below generates the list "models" of best subsets models for subsets of size 1 through 9 fitted on "dat1train". Please select the best model according to its mean squared error on the validation data, "dat1valid", state the number of variables, the mean squared error, and the summary of the model.
```{r}
best<-regsubsets(x=Xtrain,y=ytrain,method="exhaustive",
nvmax=9,nbest=1)
subsets<-summary(best)$which
varnames<-attr(subsets,"dimnames")[[2]]
varnames[1]<-"1"
vars<-apply(matrix(1:nrow(subsets),ncol=1),1,function(x){varnames[subsets[x,]]})
models<-lapply(vars,function(x){fmla<-str_c("y~",str_c(x,collapse = "+"))
return(lm(fmla,data = dat1train))})
## Select and display the best model according to the validation data.
```
### Q1, part 4 (9 points)
Of the cross-validated backward model, the lambda.min and the lambda.1se lasso, and the best subsets models, which model has the best mean squared error on the validation set? Please show the summary of the model fit on the combined training and validation sets using the variables, or the lambda in the case of lasso, that produced the selected model. What is the mean squared error of this new fitted model on the test set?
```{r}
```
## Question 2, logistic regression (45 points total)
The code below generates ridge-penalized logistic regression models for the outcome variable "y" in the file "ytrain2.RData" based on the explanatory variables in "Xtrain2.RData". Please report the confusion matrix, accuracy, precision (proportion of correctly predicted 1's among the cases predicted to be 1's by the model), recall (proportion of correctly predicted 1's among cases with outcome equal to 1), and F1 (2(recall)(precision)/ (recall + precision)) when the lambda.min model is used to predict values of "y" on the test data,"ytest2.RData" "Xtest2.RData"
```{r }
load("Xtrain2.RData")
load("Xtest2.RData")
load("ytrain2.RData")
load("ytest2.RData")
set.seed(34567)
cvfit = cv.glmnet(x=Xtrain, y=ytrain,family="binomial",alpha=0)
plot(cvfit)
coef(cvfit, s = "lambda.min")
cvpred<-predict(cvfit,Xtest,s=cvfit$lambda.min, type="response")
cvpred.val<-cvpred>=.5
```
#### The confusion matrix (9 points):
```{r}
# your confusion matrix
```
#### Accuracy (9 points):
```{r}
# your accuracy
```
#### Precision (9 points):
```{r}
# your precision
```
#### Recall (9 points):
```{r}
# your recall
```
#### F1 (9 points):
```{r}
# your F1
```
## Question 3, count data (9 points)
In "dat3basic", loaded below,the values of "y" represent counts of occurrences of a phenomenon. Which model(s) seem most suited to predicting "y" from "x" among Poisson regression, quasipoisson regression, negative binomial regression,and linear regression? Please explain your reasoning on the basis of the data, summaries of the models, and other simple diagnostics you choose.
#### Poisson
```{r}
load("dat3basic.RData")
qplot(data=dat3basic,x,y)
m1<-glm(y~x,data=dat3basic, family="poisson")
summary(m1)
### optional supplementary diagnostics ###
```
#### Quasipoisson
```{r}
m2<-glm(y~x,data=dat3basic,family="quasipoisson")
summary(m2)
### optional supplementary diagnostics ###
```
#### Negative binomial
Note that the standard deviation for the negative binomial is $\sqrt{\mu+\frac{\mu^2}{\theta}}$.
```{r}
library(MASS)
m.nb<-glm.nb(y~x, data=dat3basic)
summary(m.nb)
### optional supplementary diagnostics ###
```
#### Linear regression
```{r}
m3<-lm(y~x, data=dat3basic)
summary(m3)
resid.standard<-stdres(m3)
### supplementary calculations ###
```
## Question 4, survival (10 points total)
The data below represent survival data with the time to death or conclusion of the study in the variable "death". The "status" variable records a "1" for death and a "0" for censoring. A categorical explanatory variable "group" is included.
```{r}
load("dat4.RData")
fit <- survfit(Surv(death,status) ~ group,
data = dat.surv)
ggsurvplot(fit)
```
### Q4, part 1 (5 points)
Exponential and Weibull AFT models are fit below. Some visualization is provided. which model(s), if any, is(are) a good fit for the data?
Fitted models:
```{r}
model.weib<-survreg(Surv(death,status)~group, dist="weibull",data=dat.surv)
summary(model.weib)
model.exp<-survreg(Surv(death,status)~group, dist="exponential",data=dat.surv)
summary(model.exp)
```
Kaplan Meier plot with predictions:
```{r}
plot(survfit(Surv(death,status) ~ group,
data = dat.surv),lty=c(1,3,5),xlab="Survival Probability")
legend("bottomleft", c("a", "b","c"), lty = c(1,3,5))
points(seq(0,10,by=.2),
1-psurvreg(seq(0,10,by=.2),mean=model.weib$coefficients[1],
scale=model.weib$scale),type="l",lty=1)
points(seq(0,10,by=.2),
1-pexp(seq(0,10,by=.2),rate=1/exp(model.exp$coefficients[1])),
type="l", lty=1,col="gray")
points(seq(0,10,by=.2),
1-psurvreg(seq(0,10,by=.2),mean=sum(model.weib$coefficients[c(1,2)]),
scale=model.weib$scale),type="l",lty=3)
points(seq(0,10,by=.2),
1-psurvreg(seq(0,10,by=.2),mean=sum(model.exp$coefficients[c(1,2)]),
distribution = "exponential"),type="l",lty=3,col="gray")
points(seq(0,10,by=.2),
1-psurvreg(seq(0,10,by=.2),mean=sum(model.weib$coefficients[c(1,3)]),scale=model.weib$scale),
type="l",lty=5)
points(seq(0,10,by=.2),
1-psurvreg(seq(0,10,by=.2),mean=sum(model.exp$coefficients[c(1,3)]),
distribution = "exponential"),type="l",lty=5,col="gray")
```
### Q4, part 2 (5 points)
Does this analysis provide evidence the the survival probability over time differs by group for these data?