4/5/2022
Data from the openintro package: elmhurst
.
lm(gift_aid ~ family_income, data = elmhurst)
## (Intercept) family_income ## 24.3193 -0.0431
For each additional $1,000 of family income, we would expect students to receive a net difference of 1,000 * (-0.0431) = -$43.10 in aid on average, i.e., $43.10 less in gift aid, on average.
…exactly $43.10 for all students at this school?!
… is the process of using sample data to make conclusions about the underlying population the sample came from.
Use data from samples to calculate sample statistics, which can then be used as estimates for population parameters.
A plausible range of values for the population parameter is a confidence interval.
We can quantify the variability of sample statistics using
summary(lm(gift_aid ~ family_income, data = elmhurst))$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 24.31932901 1.29145027 18.831022 8.281020e-24 ## family_income -0.04307165 0.01080947 -3.984621 2.288734e-04
or
Generated assuming there are more students like the ones in the observed sample…
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples
Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics
Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution
we could keep going…
## # A tibble: 2 × 6 ## term .lower .estimate .upper .alpha .method ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 (Intercept) 21.8 24.4 26.8 0.05 percentile ## 2 family_income -0.0695 -0.0445 -0.0219 0.05 percentile
We are 95% confident that for each additional $1,000 of family income, we would expect students to receive $69.50 to $21.90 less in gift aid, on average.
# set a seed set.seed(1234) # take 1000 bootstrap samples elmhurst_boot <- bootstraps(elmhurst, times = 1000) # for each sample # fit a model and save output in model column # tidy model output and save in coef_info column elmhurst_models <- elmhurst_boot %>% mutate( model = map(splits, ~ lm(gift_aid ~ family_income, data = .)), coef_info = map(model, tidy) )
# unnest coef_info (for intercept and slope) elmhurst_coefs <- elmhurst_models %>% unnest(coef_info) # calculate 95% (default) percentile interval int_pctl(elmhurst_models, coef_info)
Write a function to create a 95% confidence interval for a linear regression slope using bootstrapping from scratch.
Discuss and visualize how you could generate a bootstrapped estimate of sampling variability for…
“Random” sample of 10 2-bedroom apartments for rent in Bozeman, MT in April 2022:
apts <- data.frame(price = c(2349, 3500, 2650, 1250, 1700, 2075, 2000, 1900, 2849, 1275))
(Data from apartments.com…. can we assume this is a random sample? From what population?)
Use bootstrapping to estimate the sampling variability of the sample mean.
How could we use bootstrapping to calculate a p-value for a test of
\(H_0: \mu = 2000\)
\(H_a: \mu > 2000\),
where \(\mu\) is the “true” mean rent for 2-bedroom apartments in Bozeman.
What if we wanted to test the median rather than the mean?
A 95% confidence interval for the mean rent of two bedroom apartments in Bozeman was calculated as ($1650.70, $2658.90). Which of the following is a correct interpretation of this interval?
What is the difference between a bootstrap distribution and a sampling distribution?
Where would you expect a sampling distribution to be centered?
Where would you expect a bootstrap distribution to be centered?
How should the variability in a sampling distribution of, say, a sample mean, change as the sample size increases? Why?