5/3/2022

Coming Up

This week:

  • Wednesday (May 4): Lab 9 due in GitHub (by 5pm); Homework 9 due in D2L (by 11pm)
  • Friday (May 6) by 11pm:
    • Take-home final exam (optional) released at 8:00am in D2L
    • Post comments on at least two of your classmate’s RShiny app D2L discussion posts
    • Complete D2L team evaluation survey

Coming Up

Finals week:

  • Take-home final exam (optional) due in Gradescope by 11:00pm Monday May 9
  • In-class final exam Tuesday May 10 at 12:00-1:50pm in our usual classroom

Your final exam grade will be the higher of the following scores:

  1. In-class final exam score.
  2. Weighted average of your take-home and in-class final exam scores, weighted as 40% in-class, 60% take-home.

Statistical Learning Overview

Statistical Learning

Here are a few questions to consider:

  • What does statistical learning mean to you?
  • Is statistical learning different from statistics as a whole?
  • What about terms like: data science, data mining, data analytics, machine learning, predictive analytics, how are these different from statistics and statistical learning?

Statistical Learning Definition

Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines.

Courtesy of An Introduction to Statistical Learning: with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Note: a free e-version of this textbook can be obtained through the MSU Library.

Some Vocabulary

  • Features - what statisticians typically call “variables” or “predictors”
  • Training vs testing data - Use training data to fit the model; use testing data to evaluate the model (e.g., goodness-of-fit, predictive)
  • Supervised learning - Outcome variable values (“labels”) are known for your training data, e.g., linear regression, logistic regression, classification
  • Unsupervised learning - No clear outcome variable or outcome variable values unknown for training data, e.g., clustering, dimension reduction

Predictive Modeling

Predictive Modeling

“The basic goal of predictive modeling is to find a function that accurately describes how different measured explanatory variables can be combined to make a prediction about a response variable.”

Predictive Modeling

Recall the Seattle housing data set, how would you:

  • Build a model to predict housing prices in King County
  • Determine if your model was good or useful?
price bedrooms bathrooms sqft_living sqft_lot floors waterfront zipcode
1350000 3 2.50 2753 65005 1.0 1 98070
228000 3 1.00 1190 9199 1.0 0 98148
289000 3 1.75 1260 8400 1.0 0 98148
720000 4 2.50 3450 39683 2.0 0 98010
247500 3 1.75 1960 15681 1.0 0 98032
850830 3 2.50 2070 13241 1.5 0 98102
890000 4 1.00 2550 4000 2.0 0 98109
258000 5 2.00 2260 12500 1.0 0 98032
440000 3 2.50 1910 66211 2.0 0 98024
213000 2 1.00 1000 10200 1.0 0 98024

Loss functions

A loss function is a principled way to compare a set of predictive models.

Squared Error: \[ (Price_{pred} - Price_{actual}) ^ 2\]

Zero-One Loss (binary setting): \[ f(x)= \begin{cases} 1,& \text{if } y_{pred} \neq y_{actual}\\ 0, & y_{pred} = y_{actual} \end{cases} \]

Model Evaluation

Suppose we fit a model using all of the Seattle housing data — can that model be used to predict prices for homes in that data set?

Model Evaluation

We cannot assess the predictive performance by fitting a model to data and then evaluating the model using the same data.

Test / Training and Cross-Validation

There are two common options to give valid estimates of model performance:

  • Test / Training approach. Generally 70% of the data is used to fit the model and the other 30% is held out for prediction.

  • Cross-Validation. Cross validation breaks your data into k groups, or folds. Then a model is fit on the data on the k-1 groups and then used to make predictions on data in the held out k\(^{th}\) group. This process continues until all groups have been held out once.

Constructing a test and training set

set.seed(11142017)
num.houses <- nrow(Seattle)
Seattle$zipcode <- as.factor(Seattle$zipcode)
test.ids <- base::sample(1:num.houses, size=round(num.houses*.3))
test.set <- Seattle[test.ids,]
train.set <- Seattle[(1:num.houses)[!(1:num.houses) %in% 
  test.ids],]
dim(Seattle)
## [1] 869  14
dim(test.set)
## [1] 261  14
dim(train.set)
## [1] 608  14

Linear Regression

lm.1 <- lm(price ~ bedrooms + bathrooms + sqft_living + 
                   zipcode + waterfront, 
           data=train.set)
summary(lm.1)

Linear Regression

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + zipcode + 
##     waterfront, data = train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1155488  -127712     1666   114011  3137051 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -169817.43   53163.86  -3.194  0.00148 ** 
## bedrooms      -69444.04   15330.75  -4.530 7.14e-06 ***
## bathrooms     -19022.84   20421.92  -0.931  0.35198    
## sqft_living      399.69      16.65  24.000  < 2e-16 ***
## zipcode98014   50266.49   43462.60   1.157  0.24792    
## zipcode98024   54464.83   49042.85   1.111  0.26721    
## zipcode98032    3425.56   43273.05   0.079  0.93693    
## zipcode98039 1074229.67   58144.69  18.475  < 2e-16 ***
## zipcode98070  115702.36   45832.19   2.524  0.01185 *  
## zipcode98102  485359.79   45681.29  10.625  < 2e-16 ***
## zipcode98109  492156.91   45342.21  10.854  < 2e-16 ***
## zipcode98148   69899.53   50447.96   1.386  0.16640    
## waterfront    165102.07   68547.48   2.409  0.01632 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264000 on 595 degrees of freedom
## Multiple R-squared:  0.8224, Adjusted R-squared:  0.8188 
## F-statistic: 229.6 on 12 and 595 DF,  p-value: < 2.2e-16

Linear Regression

mad.lm1 <- mean(abs(test.set$price - predict(lm.1, test.set)))

The mean absolute deviation in housing price predictions using the linear model is $\(1.80149\times 10^{5}\)

Polynomial Regression

Now include squared terms for square foot of living space, too.

train.set$sqft_living2 <- train.set$sqft_living^2
test.set$sqft_living2 <- test.set$sqft_living^2

lm.2 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + 
                   zipcode + waterfront, 
           data=train.set)
summary(lm.2)

Polynomial Regression

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + 
##     zipcode + waterfront, data = train.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1023919   -87495     -575    72849  1223170 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.413e+05  4.228e+04   3.343 0.000882 ***
## bedrooms     -1.268e+04  1.176e+04  -1.078 0.281402    
## bathrooms     5.290e+03  1.532e+04   0.345 0.729936    
## sqft_living   2.262e+01  2.140e+01   1.057 0.290964    
## sqft_living2  4.711e-02  2.174e-03  21.669  < 2e-16 ***
## zipcode98014  5.378e+04  3.251e+04   1.654 0.098580 .  
## zipcode98024  5.298e+04  3.668e+04   1.444 0.149183    
## zipcode98032 -6.072e+04  3.250e+04  -1.868 0.062218 .  
## zipcode98039  1.079e+06  4.349e+04  24.817  < 2e-16 ***
## zipcode98070  1.151e+05  3.428e+04   3.356 0.000840 ***
## zipcode98102  4.350e+05  3.425e+04  12.701  < 2e-16 ***
## zipcode98109  4.772e+05  3.392e+04  14.069  < 2e-16 ***
## zipcode98148  1.528e+03  3.786e+04   0.040 0.967819    
## waterfront    1.612e+05  5.127e+04   3.144 0.001751 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 197400 on 594 degrees of freedom
## Multiple R-squared:  0.9008, Adjusted R-squared:  0.8986 
## F-statistic: 414.9 on 13 and 594 DF,  p-value: < 2.2e-16

Polynomial Regression

mad.lm2 <- mean(abs(test.set$price - predict(lm.2,test.set)))

Including this squared term lowers our predictive error from $\(1.80149\times 10^{5}\) in the first case to $\(1.38583\times 10^{5}\).

Decision Trees

Decision Trees

rmse.tree1 <- sqrt(mean((test.set$price - predict(tree1,test.set))^2)) 
mad.tree1 <- mean(abs(test.set$price - predict(tree1,test.set)))
mad.tree1
## [1] 205934.8
mad.lm1
## [1] 180149.3
mad.lm2
## [1] 138583

The predictive error for this tree, $\(2.05935\times 10^{5}\) is similar to the first linear model $\(1.80149\times 10^{5}\) and not quite as good as our second linear model $\(1.38583\times 10^{5}\).

Ensemble Methods - Random Forest

Ensemble methods combine a large set of predictive models into a single framework. One example is a random forest, which combines a large number of trees.

While these methods are very effective in a predictive setting, it is often difficult to directly assess the impact of particular variables in the model.

Random Forest

One specific kind of ensemble method is known as a random forest, which combines several decision trees.

rf1 <- randomForest(price ~ ., data = Seattle)

mad.rf <- mean(abs(test.set$price - predict(rf1,test.set)))

The prediction error for the random forest is substantially better than the other models we have identified $\(5.1452\times 10^{4}\).

Exercise: Prediction for Capital Bike Share

bikes <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/Bike.csv')
set.seed(11142017)
num.obs <- nrow(bikes)
test.ids <- base::sample(1:num.obs, size=round(num.obs*.3))
test.bikes <- bikes[test.ids,]
train.bikes <- bikes[(1:num.obs)[!(1:num.obs) %in% 
  test.ids],]
dim(bikes)
## [1] 10886    12
dim(test.bikes)
## [1] 3266   12
dim(train.bikes)
## [1] 7620   12

Exercise: Prediction for Capital Bike Share

lm.bikes <- lm(count ~ holiday + atemp,
               data=train.bikes)
lm.mad <- mean(abs(test.bikes$count - predict(lm.bikes,test.bikes)))

Create another predictive model and compare the results to the MAD of the linear model above (\(127\)). However, don’t use both casual and registered in your model as those two will sum to the total count.

A Solution – Prediction for Capital Bike Share

rf.bikes <- randomForest(count ~ holiday + atemp +
     humidity + season + workingday + weather,
                data=train.bikes)
tree.mad <- mean(abs(test.bikes$count - predict(rf.bikes,test.bikes)))

The random forest has a prediction error of \(108\).

Classification Methods

Classification - Given new points (*) how do we classify them?

Logistic Regression

## 
## Call:
## glm(formula = labels ~ x + y, family = "binomial", data = supervised)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93887  -0.00046   0.00000   0.00285   1.44209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   23.898     11.813   2.023   0.0431 *
## x            -47.482     23.973  -1.981   0.0476 *
## y             -6.214      3.456  -1.798   0.0722 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.629  on 99  degrees of freedom
## Residual deviance:  13.024  on 97  degrees of freedom
## AIC: 19.024
## 
## Number of Fisher Scoring iterations: 10

Logistic Regression

x y Prob[Val = 1]
0.20 0.70 1.000
0.45 0.35 0.588
0.48 0.30 0.319
0.90 0.40 0.000

Decision Trees

Decision Trees - code

kable(cbind(pred.points,
    round(predict(tree1,as.data.frame(pred.points))[,2],3)),
    col.names=c('x','y','Prob[Val = 1]'))

tree1 <- rpart(labels ~., data=supervised, method = 'class')
plot(tree1, margin = 1)
text(tree1, fancy = TRUE)

Decision Trees - Boundary

Exercise: Predict Titanic Survival

titanic <- read.csv(
  'http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/titanic.csv')
set.seed(11142017)
titanic <- titanic %>% filter(!is.na(Age))
num.pass <- nrow(titanic)
test.ids <- base::sample(1:num.pass, size=round(num.pass*.3))
test.titanic <- titanic[test.ids,]
train.titanic <- titanic[(1:num.pass)[!(1:num.pass) %in% 
  test.ids],]
dim(titanic)
## [1] 714  12
dim(test.titanic)
## [1] 214  12
dim(train.titanic)
## [1] 500  12

Exercise: Predict Titanic Survival

See if you can improve the classification error from the model below.

glm.titanic <- glm(Survived ~ Age, data=train.titanic, family='binomial')
Class.Error <- mean(test.titanic$Survived != round(predict(glm.titanic, test.titanic, type='response')))

The logistic regression model only using age is wrong \(44\)% of the time.

Unsupervised Learning

Supervised vs. Unsupervised Learning

Supervised

Unsupervised - How many clusters?

Unsupervised

k-means clustering

k-means clustering

## K-means clustering with 3 clusters of sizes 26, 30, 44
## 
## Cluster means:
##        [,1]      [,2]
## 1 0.1329025 0.8018608
## 2 0.2844651 0.1830810
## 3 0.7693055 0.6026478
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 1 3 3
##  [75] 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 1.175133 1.417926 1.915896
##  (between_SS / total_SS =  75.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

k-means clustering - code

km <- kmeans(combined, 3)
plot(combined,type='n',axes=F, xlab='',ylab='')
box()
points(combined,pch=as.character(km$cluster),
       col=c(rep('dodgerblue',25),
             rep('forestgreen',25),
             rep('firebrick4',50)))
draw.circle(.31,-0.1,.335, border='dodgerblue')
draw.circle(.79,.65,.3, border='firebrick4')
draw.circle(.14,1.05,.3, border='forestgreen')

Hierarchical clustering

Hierarchical clustering - with 3 clusters

Hierarchical clustering - with 4 clusters

Hierarchical clustering - code

hc <- hclust(dist(combined))
plot(hc, hang=-1)
plot(combined,type='n',axes=F, xlab='',ylab='')
box()
points(combined,pch=as.character(cutree(hc,4)),
       col=c(rep('dodgerblue',25),
             rep('forestgreen',25),
             rep('firebrick4',50)))

How to choose the number of clusters?

Given these plots that we have seen, how do we choose the number of clusters?

How to choose the number of clusters? - Scree plot

Scree plot - code

wss <- rep(0,15)
for (i in 1:15) {
  wss[i] <- sum(kmeans(combined,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters", 
     ylab="Within groups sum of squares")

Data with more than 2 dimensions

warm-blooded can fly vertebrate endangered have hair
ant No No No No No
bee No Yes No No Yes
cat Yes No Yes No Yes
cow Yes No Yes No Yes
duc Yes Yes Yes No No
eag Yes Yes Yes Yes No
ele Yes No Yes Yes No
fly No Yes No No No
fro No No Yes Yes No
lio Yes No Yes Yes Yes
liz No No Yes No No
lob No No No No No
rab Yes No Yes No Yes
spi No No No No Yes
wha Yes No Yes Yes No

Multidimensional Scaling

MDS - Code

animals <- cluster::animals

colnames(animals) <- c("warm-blooded", 
                       "can fly",
                       "vertebrate",
                       "endangered",
                       "live in groups",
                       "have hair")
animals.cluster <- animals[,-(5)]
animals.cluster <- animals.cluster[-c(4,5,12,16,18),]
animals.cluster[10,4] <- 2
animals.cluster[14,4] <- 1

d <- dist(animals.cluster) 
fit <- cmdscale(d, k=2) 
fit.jitter <- fit + runif(nrow(fit*2),-.15,.15)
plot(fit.jitter[,1], fit.jitter[,2], xlab="", ylab="",   main="",   type="n",axes=F)
box()
text(fit.jitter[,1], fit.jitter[,2], labels = row.names(animals.cluster), cex=1.3)

Hierarchical Clustering of Animals

Exercise: Clustering Zoo Animals

Use the dataset create below for the following questions.

zoo.data <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat408/datasets/ZooClean.csv')
rownames(zoo.data) <- zoo.data[,1]
zoo.data <- zoo.data[,-1]
  • Use multidimensional scaling to visualize the data in two dimensions. What are two animals that are very similar and two that are very different?

  • Create a hierachical clustering object for this dataset. Why are a leopard and raccoon clustered together for any cluster size?

  • Now add colors corresponding to four different clusters to your MDS plot. Interpret what each of the four clusters correspond to.