Lab 09 - Predictive modeling

Why do 30% of patients miss their scheduled appointments? What variables could help us predict whether a patient will miss their appointment? In this lab, you will build a model to predict whether a patient will miss a scheduled doctor’s appointment. Extra credit to the team with the lowest classification error!

Learning goals

Getting started

Since your lab submission will involve the use of LaTeX code for mathematical equations, this lab’s .Rmd file should be knit to a .pdf file.

Each member of the team should:

Warm up

Let’s warm up with some simple exercises.

Packages

To fit our logistic regression models, we’ll use the glm (generalized linear model) function in base R. You may want to use the tidyverse package for data wrangling and visualization.

library(tidyverse) 

Data

For this lab we will be using a subset of a dataset collected in Brazil relating to whether patients miss a scheduled medical appointment. More information about the dataset is available at: https://www.kaggle.com/joniarroba/noshowappointments.

Exercises

  1. Look at the Data Dictionary and dataset description at https://www.kaggle.com/joniarroba/noshowappointments. Before fitting a model, discuss which variables you think might be related to missing a medical appointment. Specifically, what relationship do you expect between these variables and missing a medical appointment, and why?

Why shouldn’t you evaluate a predictive model using the same data set on which it was fit?

The overall goal is to find a model that accurately predicts whether a patient will miss a scheduled medical appointment based on the variables in the data set. Since our goal is prediction, we will fit the model on a training set, then evaluate the model on a testing set. These have been created for you, and can be read in using the following code. Note that we recode our outcome variable from Yes/No to 1/0 in order to fit the predictive models below.

# Read in Data
train.appts <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTrain.csv')
test.appts <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/MedApptsTest.csv')

# Recode outcome variable to 
# 1 = Missed appointment, 0 = Did not miss appointment
train.appts$No.show <- as.numeric(train.appts$No.show == "Yes")
test.appts$No.show <- as.numeric(test.appts$No.show == "Yes")
  1. In predictive modeling, a fair amount of data wrangling is necessary to create relevant variables to use for prediction. Examples in this setting include: appointment delay (time between scheduling having appointment), whether the user has missed an appointment before (the user ID has been removed from the data base), the day of week of the appointment. You don’t need to do this, but explain how you would create a variable that had the day of week for the appointment.

  2. What proportion of the training data set were no-shows? That is, what proportion of the observations in the train.appts data frame have a value of 1 for the No.show variable?

For our models, we will use classification error and another criteria called log-loss. Classification error is the proportion of observations that are mis-classified by the model. Log-loss is useful to evaluate a case when a prediction is made as a probability (prob missed appointment = .3) and the result is a binary outcome (missed appointment or not). The goal is to have a lower value for log-loss. Mathematically log-loss can be written as,

\[ \mbox{log-loss}=−y \times log(p)−(1−y)\times log(1−p) \]

where \(y\) is the observed outcome (0 = “failure”, 1 = “success”), and \(p\) is the predicted probability of a “success” (which means that \(1-p\) is the predicted probability of a “failure”). In a dataset, we calculate this loss value for each observation, and then take the mean:

\[ \frac{1}{n}\sum_{i=1}^n \left[-y_i \log(\hat{y}_i) - (1-y_i)\log(1 - \hat{y}_i)\right], \] where \(y_i\) and \(\hat{y}_i\) are the observed outcome and predicted probability for the \(i\)th observation, respectively.

Your final model should show an improved classification error and log-loss over the baseline model below. The baseline model uses the observed proportion of no-shows in the data set as the predicted probability of a no-show for every individual.

# Fit baseline logistic regression
glm.baseline <- glm(No.show ~ 1, family = binomial, 
                    data = train.appts)
  1. The predict function will report predicted probabilities of a no-show when we use the argument type = "response". Run the code below to find the predicted probabilities of a no-show for the first few observations in the training data set. Do these values match the value you found in Exercise 3? (Hint: They should!)
head(predict(glm.baseline, type = "response"))

In order to calculate classification error, we need to convert the predicted probability to a predicted Yes/No outcome by choosing a cutoff value \(c\): the model will predict the patient misses their appointment if their predicted probability is greater than \(c\). A logical choice for \(c\) is 0.5, but this cutoff won’t work well if the sample proportion is far from 0.5. For example, if only 10% of people miss appointments, a good model should predict missed appointments 10% of the time. Thus, often the sample proportion is used for \(c\).

Now, let’s calculate the classification error and log-loss for the baseline model.

# Observed outcome in test dataset
y <- test.appts$No.show

# Calculate vector of predicted probabilities for test dataset
pred_probs <- predict(glm.baseline, test.appts, type = "response")

# Calculate classification error
CE <- mean(
  (y == 1) != (pred_probs > 0.215)
  )

# Calculate log-loss
log.loss <- mean(
  -as.numeric(y == 1) * log(pred_probs) - 
  as.numeric(y == 0) * log(1 - pred_probs)
  )
  1. Examine the code above. Explain how this code calculates classification error and log-loss. That is, explain what mean((y == 1) != (pred_probs > 0.215)) and mean(-as.numeric(y == 1) * log(pred_probs) - as.numeric(y == 0) * log(1 - pred_probs)) are doing.

  2. What are the values of the classification error and log-loss for the baseline model?

Mathematical background

Logistic regression models are appropriate for modeling a binary response variable. Generically, we refer to the two outcomes of a binary variable as “success” and “failure”, where a “success” is coded as 1, and a “failure” as 0. A logistic regression model is a special case of a generalized linear model.

Technically, the term “linear model” comes from the fact that the linear predictor is linear in the coefficients \(\beta_0, \beta_1,\ldots, \beta_k\), not the predictors \(x_1,\ldots, x_k\). That is why a model such as \(\mu = \beta_0 + \beta_1 x + \beta_2 x^2\) is still considered a “linear model”.

In a linear model, we assume the response variable \(Y\), has a normal distribution with mean \(\mu = E(Y)\) and variance \(\sigma^2\). We then model the mean response as a linear function of \(k\) predictor variables \(x_1, \ldots, x_k\):

\[ \mu = \beta_0 + \beta_1 x_1 + \cdots \beta_k x_k. \]

In a generalized linear model, \(Y\) can take on any probability distribution (Binomial or Poisson are common), and instead of modeling the mean response directly, we model a function of the mean response, called a link function:

\[ g(\mu) = \beta_0 + \beta_1 x_1 + \cdots \beta_k x_k. \] When \(Y\) is binary, taking on values 0 or 1, \(E(Y) = Pr(Y = 1) = p\), and the link function is the logit link, or “log odds”:

\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots \beta_k x_k. \] Solving for \(p\) yields a logistic curve: \[ p = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots \beta_k x_k)}{1 + \exp(\beta_0 + \beta_1 x_1 + \cdots \beta_k x_k)} \] In the case of a single quantitative predictor variable, \(x\), this reduces to \[ p = \frac{\exp(\beta_0 + \beta_1 x)}{1 + \exp(\beta_0 + \beta_1x)}. \]

See Figure 10.1 in Modern Data Science with R for a more typical example of a fitted logistic curve.

For example, the code below fits a logistic regression model using Age as a predictor.

glm.age <- glm(No.show ~ Age, family = binomial, 
               data = train.appts)
train.appts %>% ggplot(aes(x = Age, y = No.show)) +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "glm", 
              method.args = list(family = "binomial")) +
  labs(x = "Age (years)", y = "Probability of No Show") +
  ggtitle("Probability of No Show by Age")

summary(glm.age)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0199683 0.1040575 -9.801972 0.0000000
Age -0.0065882 0.0022343 -2.948641 0.0031917
  1. Using the model summary above, write out the fitted model equations by filling in the blanks in the LaTeX code included in your .Rmd file.

  2. Calculate the classification error and log-loss for the logistic regression model using Age as the only predictor (glm.age). Compare these values to the classification error and log-loss of the baseline model. Which model give better predictions?

  3. Use the training dataset to fit another predictive model for determining whether a patient will miss the medical appointment. Then evaluate your model by making predictions on the test dataset. Repeat this process several times, tracking the classification error and log-loss for each model, until you arrive at a “final model”. Summarize your findings and discuss the accuracy of your predictive approach. Why did you choose this model over alternatives?

Modern Data Science with R uses the tidymodels R package for fitting, evaluating, and visualizing predictive models.

  1. Extra Credit. Generate a graphical representation of the prediction accuracy of your chosen model. Write a few sentences describing what this visualization tells us about the model. Hint: See Figures 10.2 and 10.3 in Modern Data Science with R. Or come up with your own!