STAT 408: Week 11

4/5/2022

Motivation

Data

Data from the openintro package: elmhurst.

Family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois, USA
Gift aid is financial aid that does not need to be paid back, as opposed to a loan

Linear model

Interpreting the slope

lm(gift_aid ~ family_income, data = elmhurst)

##   (Intercept) family_income 
##       24.3193       -0.0431

For each additional $1,000 of family income, we would expect students to receive a net difference of 1,000 * (-0.0431) = -$43.10 in aid on average, i.e., $43.10 less in gift aid, on average.

…exactly $43.10 for all students at this school?!

Inference

Statistical inference

… is the process of using sample data to make conclusions about the underlying population the sample came from.

Estimation

Use data from samples to calculate sample statistics, which can then be used as estimates for population parameters.

If you want to catch a fish, do you prefer a spear or a net?…

If you want to estimate a population parameter, do you prefer to report a range of values the parameter might be in, or a single value?

If we report a point estimate, we probably won’t hit the exact population parameter.
If we report a range of plausible values we have a good shot at capturing the parameter.

Confidence intervals

A plausible range of values for the population parameter is a confidence interval.

In order to construct a confidence interval we need to quantify the variability of our sample statistic.

For example, if we want to construct a confidence interval for a population slope, we need to come up with a plausible range of values around our observed sample slope.

This range will depend on how precise and how accurate our sample statistic is as an estimate of the population parameter.

Quantifying this requires a measurement of how much we would expect the sample statistic to vary from sample to sample, which is called sampling variability.

Sampling variability

Quantifying the variability of sample statistics

We can quantify the variability of sample statistics using

theory: via Central Limit Theorem

summary(lm(gift_aid ~ family_income, data = elmhurst))$coef

##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)   24.31932901 1.29145027 18.831022 8.281020e-24
## family_income -0.04307165 0.01080947 -3.984621 2.288734e-04

simulation: via bootstrapping…

Bootstrapping

“pulling oneself up by one’s bootstraps”: accomplishing an impossible task without any outside help
Impossible task: estimating a population parameter using data from only the given sample
Note: Notion of saying something about a population parameter using only information from an observed sample is the crux of statistical inference

Observed sample

Bootstrapped “population”

Generated assuming there are more students like the ones in the observed sample…

Bootstrapping scheme

Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution

Bootstrap sample 1

Bootstrap sample 2

Bootstrap sample 3

Bootstrap sample 4

Bootstrap samples 1–4

we could keep going…

Many many samples…

Slopes of bootstrap samples

95% confidence interval

Interpreting the slope, take two

## # A tibble: 2 × 6
##   term           .lower .estimate  .upper .alpha .method   
##   <chr>           <dbl>     <dbl>   <dbl>  <dbl> <chr>     
## 1 (Intercept)   21.8      24.4    26.8      0.05 percentile
## 2 family_income -0.0695   -0.0445 -0.0219   0.05 percentile

We are 95% confident that for each additional $1,000 of family income, we would expect students to receive $69.50 to $21.90 less in gift aid, on average.

Code using tidymodels package

# set a seed
set.seed(1234)

# take 1000 bootstrap samples
elmhurst_boot <- bootstraps(elmhurst, times = 1000)

# for each sample
# fit a model and save output in model column
# tidy model output and save in coef_info column 
elmhurst_models <- elmhurst_boot %>%
  mutate(
    model = map(splits, ~ lm(gift_aid ~ family_income, data = .)),
    coef_info = map(model, tidy)
  )

Code using tidymodels package

# unnest coef_info (for intercept and slope)
elmhurst_coefs <- elmhurst_models %>%
  unnest(coef_info)

# calculate 95% (default) percentile interval
int_pctl(elmhurst_models, coef_info)

Demo/Exercise

Write a function to create a 95% confidence interval for a linear regression slope using bootstrapping from scratch.

Discussion

Discuss and visualize how you could generate a bootstrapped estimate of sampling variability for…

one mean
one proportion
difference in means

Demo/Exercise

“Random” sample of 10 2-bedroom apartments for rent in Bozeman, MT in April 2022:

apts <- data.frame(price = c(2349, 3500, 2650, 1250, 1700,
                             2075, 2000, 1900, 2849, 1275))

(Data from apartments.com…. can we assume this is a random sample? From what population?)

Use bootstrapping to estimate the sampling variability of the sample mean.

Extension

How could we use bootstrapping to calculate a p-value for a test of

$H_0: \mu = 2000$

$H_a: \mu > 2000$,

where $\mu$ is the “true” mean rent for 2-bedroom apartments in Bozeman.

What if we wanted to test the median rather than the mean?

Interpreting (frequentist) confidence intervals

A 95% confidence interval for the mean rent of two bedroom apartments in Bozeman was calculated as ($1650.70, $2658.90). Which of the following is a correct interpretation of this interval?

95% of the time, the mean rent of two bedroom apartments in a sample of 10 rentals is between $1650.70 and $2658.90.
95% of all two bedroom apartments in Bozeman have rents between $1650.70 and $2658.90.
We are 95% confident that the mean rent of all two bedroom apartments in Bozeman is between $1650.70 and $2658.90.
We are 95% confident that the mean rent of two bedroom apartments in this sample is between $1650.70 and $2658.90.

Reflection

What is the difference between a bootstrap distribution and a sampling distribution?
Where would you expect a sampling distribution to be centered?
Where would you expect a bootstrap distribution to be centered?
How should the variability in a sampling distribution of, say, a sample mean, change as the sample size increases? Why?