Problem Set 5
This problem uses data from the Prevention of REnal and Vascular END-stage Disease (PRE-
graphic data for 4,095 individuals are stored in the prevend dataset in the oibiostat package.
Body mass index (BMI) is a measure of body fat that is based on both height and weight.
World Health Organization and National Institutes for Health define a BMI of over 25.0
overweight; this guideline is typically applied to adults in all age groups. However, a recent
has reported that individuals of ages 65 or older with the greatest mortality risk were those
BMI lower than 23.0, while those with BMI between 24.0 and 30.9 were at lower risk of
These findings suggest that the ideal weight-for-height in older adults may not be the same as in
Explore the relationship between BMI (BMI) and age (age), using the same sample of 500 indi-
viduals from the prevend data as used in the Unit 6 Labs. The code to create prevend. sample is
provided in the problem set template.
a) Create a plot that shows the association between BMI and age. Based on the plot, comment
briefly on the nature of the association.
b) Fit a linear regression model to relate BMI and age.
i. Write the equation of the linear model.
ii. Interpret the slope and intercept values in the context of the data. Comment on
whether the intercept value has any interpretive meaning in this setting.
iii. Is it valid to use the linear model to estimate BMI for an individual who is 30 years old?
Explain your answer.
iv. According to the linear model, estimate the average BMI for an individual who is 60
V. Based on the linear model, how much does BMI differ, on average, between an individ- -
ual who is 70 years old versus an individual who is 50 years old?
c) Create residual plots to assess the model assumptions of linearity, constant variability, and
normally distributed residuals. In your assessment of whether an assumption is reasonable,
be sure to clearly reference and interpret relevant features of the appropriate plot.
i. Assess linearity.
ii. Assess constant variance.
iii. Assess normality of residuals.
iv. Suppose that a point is located in the uppermost right corner on a Q-Q plot of residuals
(from a linear model). In one sentence, describe where that point would necessarily be
located on a scatterplot of the data.
d) Conduct a formal hypothesis test of no association between BMI and age, at the a = 0.05
significance level. Summarize your conclusions.
e) Report the R² of the linear model relating BMI and age. Based on the R² value, briefly
comment on whether you think the estimated average BMI values calculated in part b) are
This problem uses data from the National Health and Nutrition Examination Survey (NHANES),
a survey conducted annually by the US Centers for Disease Control (CDC). The data can be
treated as if it were a simple random sample from the American population. The dataset
nhanes.samp.adult.500 in the oibiostat package contains data for 500 participants ages 21 years
or older that were randomly sampled from the complete NHANES dataset that contains 10,000
Regular physical activity is important for maintaining a healthy weight, boosting mood, and
reducing risk for diabetes, heart attack, and stroke. In this problem, you will be exploring
the relationship between weight (Weight) and physical activity (PhysActive) using the data in
nhanes samp. adult. 500. Weight is measured in kilograms. The variable PhysActive is coded Yes
if the participant does moderate or vigorous-intensity sports, fitness, or recreational activities,
and No if otherwise.
a) Explore the data.
i. Identify how many individuals are physically active.
ii. Create a plot that shows the association between weight and physical activity. Describe
what you see.
b) Fit a linear regression model to relate weight and physical activity. Report the estimated
coefficients from the model and interpret them in the context of the data.
c) Report a 95% confidence interval for the slope parameter and interpret the interval in the
context of the data. Based on the interval, is there sufficient evidence at a = 0.05 to reject
the null hypothesis of no association between weight and physical activity?
d) Suppose that upon seeing the results from part c), your friend claims that these data rep-
resent evidence that being physically active promotes weight loss. Do you agree with your
friend? Explain your answer.
e) In the context of these data, would you prefer to conduct inference using the linear regres-
sion approach or the two-sample t-test approach? Explain your answer.
f) Suppose that the estimated slope coefficient from the model were positive (and statistically
significant). Propose at least two possible explanations for such a trend.
The file low_bwt. Rdata contains information for a random sample of 100 low birth weight infants
born in two teaching hospitals in Boston, Massachusetts. The data appear in Table B7 in Principles
of Biostatistics, 2nd ed. Pagano and Gauvreau. (Mother's age is present in the dataset but not
documented in the table).
The dataset contains the following variables:
- birthwt: the weight of the infant at birth, measured in grams
- gestage: the gestational age of the infant at birth, measured in weeks
- momage: the mother's age at the birth of the child, measured in years
- toxemia: recorded as Yes if the mother was diagnosed with toxemia during pregnancy, and
- length: length of the infant at birth, measured in centimeters
- headcirc: head circumference of the infant at birth, measured in centimeters
The condition toxemia, also known as preeclampsia, is characterized by high blood pressure and
protein in urine by the 20th week of pregnancy; left untreated, toxemia can be life-threatening.
a) Fit a linear model estimating the association between birth weight and toxemia status.
i. Write the model equation.
ii. Report a 95% confidence interval for the slope and interpret the interval.
b) Using graphical summaries, explore the relationship between birth weight and toxemia sta-
tus, birth weight and gestational age, and gestational age and toxemia. Summarize your
c) Fit a multiple regression model with toxemia and gestational age as predictors of birth
i. Evaluate whether the assumptions for linear regression are reasonably satisfied.
ii. Interpret the coefficients of the model, and comment on whether the intercept has a
iii. Write the model equation and predict the average birth weight for an infant born to a
mother diagnosed with toxemia with gestational age 31 weeks.
iv. The simple regression model and multiple regression model disagree regarding the
nature of the association between birth weight and toxemia. Briefly explain the reason
behind the discrepancy. Which model do you prefer for understanding the relationship
between birth weight and toxemia, and why?
The National Health and Nutrition Examination Survey (NHANES) is a yearly survey conducted
by the US Centers for Disease Control. This question uses the nhanes.samp.adult.500
the oibiostat package, which consists of information on a subset of 500 individuals ages 21 years
and older from the larger NHANES dataset.
Poverty (Poverty) is measured as a ratio of family income to poverty guidelines. Smaller num-
bers indicate more poverty, and ratios of 5 or larger were recorded as 5. Education (Education)
is reported for individuals ages 20 years or older and indicates the highest level of education
achieved: either 8th Grade, 9 - 11th Grade, High School, Some College, or College Grad. The
variable HomeOwn records whether a participant rents or owns their home; the levels
are Own, Rent, and Other.
a) Create a plot showing the association between poverty and educational level. Describe
Fit a linear model to predict poverty from educational level.
i. Interpret the model coefficients and associated p-values.
ii. Assess whether educational level, overall, is associated with poverty. Be sure to include
any relevant numerical evidence as part of your answer.
c) Create a plot showing the association between poverty and home ownership. Based on what
you see, speculate briefly about the home ownership status of individuals
d) Fit a linear model to predict poverty from educational level and home ownership. Comment
on whether this model is an improvement from the model in part b).
Do men and women think differently about their body weight? To address this question, you will
be using data from the Behavioral Risk Factor Surveillance System (BRFSS).
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000
people in the United States collected by the Centers for Disease Control and Prevention (CDC). As
its name implies, the BRFSS is designed to identify risk factors in the adult population and report
emerging health trends. For example, respondents are asked about diet and weekly physical
activity, HIV/AIDS status, possible tobacco use, and level of healthcare coverage.
The cdc. sample dataset contains data on 500 individuals from a random sample of 20,000 respon-
dents to the BRFSS survey conducted in 2000, on the following nine variables:
- genhlth: general health status, with categories excellent, very good, good, fair, and poor
- exerany: recorded as 1 if the respondent exercised in the past month and 0 otherwise
- hlthplan: recorded as 1 if the respondent has some form of health coverage and 0 otherwise
smoke100: recorded as 1 if the respondent has smoked at least 100 cigarettes in their entire
life and 0 otherwise
- height: height in inches
- weight: weight in pounds
- wtdesire: desired weight in pounds
- age: age in years
- gender: gender, recorded as m for male and f for female
a) Create a variable called wt. discr that is a measure of the discrepancy between an indi-
vidual's desired weight and their actual weight, expressed as a proportion of their actual
actual weight - desired weight
weight discrepancy =
b) Fit a linear model to predict weight discrepancy from age and gender. Interpret the slope
coefficients in the model.
c) Investigate whether the association between weight discrepancy and age is different for
males versus females.
i. Fit a linear model to predict weight discrepancy from age, gender, and the interaction
between age and gender. Write the model equation.
ii. Write the prediction equation for males and the prediction equation for females.
iii. Is there statistically significant evidence of an interaction between age and gender?
Explain your answer.
d) Comment on whether the results from part c) suggest that men and women think differently
about their body weight. Do you find the results surprising; why or why not? Limit your
response to at most five sentences.
The American Psychological Association defines resilience as "the process of adapting well in the
face of adversity, trauma, tragedy, threats or even significant sources of stress". Resilience refers
to a person's capacity to resist adversity and is closely related to qualities such as self-confidence
and persistence. Studies have suggested that resilience is an important factor in contributing to
how medical students perceive their quality of life and educational environment.
Survey data were collected from 1,350 students across 25 medical schools in the United States as
part of a study examining the life of students and residents in healthcare professions. At each
school, 54 students were randomly selected to participate in the study. Participants completed
assessments measuring resilience, quality of life, perception of educational environment, depres-
sion symptoms, and anxiety symptoms.
Resilience. Higher scores on the resilience assessment are indicative of greater resilience;
possible scores range from 14 to 98. The scores are reported according to a standardized
scale: very low (14 to 56 points), low (57 to 64 points), moderately low (65 to 73 points),
moderately high (74 to 81 points), high (82 to 90 points), and very high (91 to 98 points).
- Quality of Life. Quality of life was assessed via three measures: overall quality of life (overall
QoL), medical school quality of life (MSQoL), and a questionnaire from the World Health
Organization (WHOQOL). For the overall QoL and MSQoL, students were asked to rate, on
a scale from 0 to 10 with a higher score indicating better QoL, their overall quality of life
and their quality of life in medical school. The WHOQOL is a 26-question survey measuring
quality of life in four domains (environment, psychological health, social relationships, and
physical health); participant responses to questions such as "Do you have enough energy for
everyday life?" and "How well are you able to concentrate?' are converted to a 0 to 100
point score for each domain, with higher scores representing better quality of life. 1
- Educational Environment. Perception of educational environment was assessed via the
DREEM questionnaire; possible scores range from 0 to 200, with higher scores representing
a more positive perception about educational environment. Questions include "I feel I am
being well prepared for my profession" and "The atmosphere motivates me as a learner". 2
- Depression Symptoms. The BDI questionnaire was used to assess depressive symptoms. Pos-
sible scores vary from 0 to 63, with higher scores indicating either more numerous or more
severe depressive symptoms: no depressive symptoms (0 to 9 points), mild depressive symp-
toms (10 to 17 points), moderate depressive symptoms (18 to 29 points), severe depressive
symptoms (30 to 63 points).
- Anxiety Symptoms. Anxiety symptoms were assessed based on two dimensions of anxiety:
state anxiety (feelings of anxiety arising specifically when faced with a stressful event) and
trait anxiety (feelings of anxiety on a daily basis). Possible scores range from 20 to 80 points.
A score of 50 or higher for either state anxiety or trait anxiety (or both) is considered indica-
tive of an anxiety disorder.
1 Participants chose from: "Not at all", "A little", "A moderate amount", "Very much", "An extreme amount".
2 Participants chose from: "Strongly agree", "Agree", "Neutral", "Disagree", "Strongly disagree".
Information was also collected on participant age and sex. Year in medical school was recorded
as current level of training. The first two years of medical school are focused on basic science
education (pre-clinical curriculum) and the last two years consist of rotations in clinical settings
(clinical curriculum). After medical school, students undergo residency training in which they
work as practicing physicians under the supervision of a senior clinician.
Data from the study are in the file resilience. Rdata. The following table provides a list of the
variables in the dataset and their descriptions.
age in years
sex, coded female for female and male for male
level of training, either pre-clinical, clinical, or residency
resilience level, either VeryHigh, High, ModHigh, ModLow, Low, or VeryLow
overall quality of life score, on 0-10 point scale
quality of life in medical school score, on 0-10 point scale
WHOQOL score for physical health domain, on 0-100 point scale
WHOQOL score for psychological health domain, on 0-100 point scale
WHOQOL score for social relationships domain, on 0-100 point scale
WHOQOL score for environmental domain, on 0-100 point scale
DREEM score, on 0-200 point scale
BDI score, on 0-63 point scale
anxiety . state
state anxiety score, on 20-80 point scale
trait anxiety score, on 20-80 point scale
Use the data to answer the following questions.
a) Briefly summarize features of the study participants with respect to the variables age, sex,
and train. Reference appropriate graphical and numerical summaries as needed.
Participants were asked to rate their overall quality of life and their medical school quality
of life, each on a 0-10 point scale.
i. Create a plot illustrating the difference between perception of overall QoL and percep-
tion of MSQoL. Describe what you see.
ii. Conduct a formal statistical comparison of overall QoL score and MSQoL score. Sum-
marize your findings.
c) Investigate the relationship between resilience and level of training.
i. Prior to conducting any analysis, comment briefly on whether you think there may or
may not be an association between resilience and level of training, and explain your
reasoning. Limit your answer to at most five sentences.
ii. Formally assess whether there is evidence of an association between resilience and level
of training. Summarize your findings.
d) Investigate the relationship between resilience and severity of depressive symptoms.
i. Create a plot illustrating the relationship between resilience and depressive symptoms.
Describe what you see.
ii. Conduct a formal analysis of the relationship between resilience and depressive symp-
toms. Summarize your findings. You may proceed with the analysis method you choose
even if the assumptions do not seem to be reasonably satisfied; i.e., it is not necessary
to check assumptions for this sub-question.
e) Investigate the association between resilience and quality of life as measured by the psycho-
logical health domain of the WHOQOL.
i. Without adjusting for any potential confounders, fit a model estimating the association
between resilience and WHOQOL score in the psychological health domain. Describe
the nature of the association.
ii. Is there evidence that resilience overall is a useful variable for predicting WHOQOL
score in the psychological health domain? Explain your answer.
iii. Report and interpret the model R² for the model fit in part i.
iv. After adjusting for the potential confounders of age, sex, training level, and BDI score,
would you describe the apparent association from part i. any differently? Explain your
V. Evaluate the assumptions behind the analysis from part iv.
vi. Calculate the predicted mean psychological health WHOQOL score for a 21-year-old
female with moderately high resilience who scored 4.00 points on the BDI and is in her
third year of medical school.
Anxiety scores were reported separately for state anxiety and trait anxiety.
i. Create a new variable, inxiety.disorder, that records whether an individual
qualifies as having an anxiety disorder, based on the values of anxiety.s state
and anxiety. trait. Briefly explain the logic behind the code you use to create
ii. Report the number of individuals that qualify as having an anxiety disorder.
g) Fit a model estimating the association between resilience and perception of educational en-
vironment, after adjusting for age, sex, training level, BDI score, and presence of an anxiety
disorder. In no more than three sentences, summarize the main finding(s).
h) A New York Times reporter is potentially interested in writing a piece about the research
you have conducted. They have requested that you prepare a short statement, no more
than ten sentences long. Selectively drawing from the results of your analyses, summarize
the main conclusions about the relationship between resilience and perceptions of quality
of life and educational environment for medical students and residents. Be sure to use
language accessible to a general audience. You do not need to reference specific numerical
results/models from the analysis, but you may choose to do so if you like.
This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.