Introduction to Predictive Modeling

This example will demonstrate predictive modeling of a quantitative outcome variable using the following methods:

linear regression
decision tree
random forest

Load data

Seattle Housing Data (King County)

Seattle <- read.csv('https://math.montana.edu/shancock/data/SeattleHousing.csv')
head(Seattle)

##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
## 1 1350000        3      2.50        2753    65005    1.0          1       2165
## 2  228000        3      1.00        1190     9199    1.0          0       1190
## 3  289000        3      1.75        1260     8400    1.0          0       1260
## 4  720000        4      2.50        3450    39683    2.0          0       3450
## 5  247500        3      1.75        1960    15681    1.0          0       1960
## 6  850830        3      2.50        2070    13241    1.5          0       1270
##   sqft_basement zipcode     lat     long yr_sold mn_sold
## 1           588   98070 47.4041 -122.451    2015       3
## 2             0   98148 47.4258 -122.322    2014       9
## 3             0   98148 47.4366 -122.335    2014       8
## 4             0   98010 47.3420 -122.025    2015       3
## 5             0   98032 47.3576 -122.277    2015       3
## 6           800   98102 47.6415 -122.315    2014       6

skim(Seattle)

Data summary
Name	Seattle
Number of rows	869
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
price	1	632591.46	634826.07	80000.00	289950.00	464950.00	710000.00	7700000.00	▇▁▁▁▁
bedrooms	1	3.20	0.92	0.00	3.00	3.00	4.00	9.00	▁▇▃▁▁
bathrooms	1	2.08	0.91	0.00	1.50	2.00	2.50	8.00	▅▇▂▁▁
sqft_living	1	2114.43	1152.25	290.00	1350.00	1850.00	2600.00	12050.00	▇▂▁▁▁
sqft_lot	1	46872.30	106328.12	713.00	6000.00	10350.00	32340.00	1164794.00	▇▁▁▁▁
floors	1	1.51	0.54	1.00	1.00	1.50	2.00	3.50	▇▅▁▁▁
waterfront	1	0.03	0.17	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
sqft_above	1	1855.57	1018.84	290.00	1180.00	1590.00	2180.00	8860.00	▇▃▁▁▁
sqft_basement	1	258.86	440.10	0.00	0.00	0.00	460.00	3480.00	▇▂▁▁▁
zipcode	1	98057.44	43.36	98010.00	98014.00	98039.00	98102.00	98148.00	▇▁▂▃▁
lat	1	47.52	0.13	47.29	47.38	47.56	47.64	47.72	▆▅▁▆▇
long	1	-122.19	0.23	-122.52	-122.34	-122.29	-121.96	-121.32	▇▂▅▁▁
yr_sold	1	2014.29	0.46	2014.00	2014.00	2014.00	2015.00	2015.00	▇▁▁▁▃
mn_sold	1	6.76	3.10	1.00	4.00	7.00	9.00	12.00	▆▇▇▆▇

Create training and testing data set

set.seed(05032022)
num.houses <- nrow(Seattle)
# need zipcode to be factor for modeling and clustering
Seattle$zipcode <- as.factor(Seattle$zipcode)
test.ids <- base::sample(1:num.houses, size=round(num.houses*.3))
test.set <- Seattle[test.ids,]
train.set <- Seattle[(1:num.houses)[!(1:num.houses) %in% 
                                      test.ids],]

Fit linear regression model on training data set

Predict price using all available predictors in the data set.

lm.1 <- lm(price ~ ., data=train.set)
summary(lm.1)

## 
## Call:
## lm(formula = price ~ ., data = train.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -785259 -113474    1493  101330 2594189 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.592e+07  7.730e+07  -0.723 0.469751    
## bedrooms      -6.018e+04  1.352e+04  -4.452 1.02e-05 ***
## bathrooms      1.103e+04  2.012e+04   0.548 0.583599    
## sqft_living    2.284e+02  2.810e+01   8.127 2.60e-15 ***
## sqft_lot      -8.747e-02  1.034e-01  -0.846 0.398129    
## floors        -5.778e+04  2.514e+04  -2.298 0.021920 *  
## waterfront     2.098e+05  6.303e+04   3.329 0.000927 ***
## sqft_above     1.468e+02  2.986e+01   4.915 1.15e-06 ***
## sqft_basement         NA         NA      NA       NA    
## zipcode98014  -1.181e+05  1.500e+05  -0.787 0.431343    
## zipcode98024  -4.584e+04  1.065e+05  -0.430 0.667125    
## zipcode98032  -7.243e+04  6.688e+04  -1.083 0.279307    
## zipcode98039   1.036e+06  1.535e+05   6.748 3.61e-11 ***
## zipcode98070   2.049e+03  1.019e+05   0.020 0.983967    
## zipcode98102   2.855e+05  1.618e+05   1.765 0.078112 .  
## zipcode98109   3.388e+05  1.639e+05   2.067 0.039185 *  
## zipcode98148  -3.438e+04  9.242e+04  -0.372 0.710022    
## lat            4.904e+05  4.408e+05   1.112 0.266381    
## long          -7.261e+04  1.658e+05  -0.438 0.661591    
## yr_sold        1.178e+04  3.442e+04   0.342 0.732168    
## mn_sold        4.947e+03  4.976e+03   0.994 0.320530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 242200 on 588 degrees of freedom
## Multiple R-squared:  0.8334, Adjusted R-squared:  0.828 
## F-statistic: 154.8 on 19 and 588 DF,  p-value: < 2.2e-16

Why would the model be giving us NAs for the sqft_basement coefficients? Let’s investigate…

alias(lm.1)

## Model :
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + sqft_above + sqft_basement + zipcode + lat + 
##     long + yr_sold + mn_sold
## 
## Complete :
##               (Intercept) bedrooms bathrooms sqft_living sqft_lot floors
## sqft_basement  0           0        0         1           0        0    
##               waterfront sqft_above zipcode98014 zipcode98024 zipcode98032
## sqft_basement  0         -1          0            0            0          
##               zipcode98039 zipcode98070 zipcode98102 zipcode98109 zipcode98148
## sqft_basement  0            0            0            0            0          
##               lat long yr_sold mn_sold
## sqft_basement  0   0    0       0

The variables sqft_basement is linearly dependent on the other variables.

lin.comb <- Seattle$sqft_living - Seattle$sqft_above
cor(Seattle$sqft_basement, lin.comb)

## [1] 1

Let’s remove the redundant sqft_basement variable:

lm.2 <- update(lm.1, . ~ . - sqft_basement)
summary(lm.2)

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + sqft_above + zipcode + lat + long + 
##     yr_sold + mn_sold, data = train.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -785259 -113474    1493  101330 2594189 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.592e+07  7.730e+07  -0.723 0.469751    
## bedrooms     -6.018e+04  1.352e+04  -4.452 1.02e-05 ***
## bathrooms     1.103e+04  2.012e+04   0.548 0.583599    
## sqft_living   2.284e+02  2.810e+01   8.127 2.60e-15 ***
## sqft_lot     -8.747e-02  1.034e-01  -0.846 0.398129    
## floors       -5.778e+04  2.514e+04  -2.298 0.021920 *  
## waterfront    2.098e+05  6.303e+04   3.329 0.000927 ***
## sqft_above    1.468e+02  2.986e+01   4.915 1.15e-06 ***
## zipcode98014 -1.181e+05  1.500e+05  -0.787 0.431343    
## zipcode98024 -4.584e+04  1.065e+05  -0.430 0.667125    
## zipcode98032 -7.243e+04  6.688e+04  -1.083 0.279307    
## zipcode98039  1.036e+06  1.535e+05   6.748 3.61e-11 ***
## zipcode98070  2.049e+03  1.019e+05   0.020 0.983967    
## zipcode98102  2.855e+05  1.618e+05   1.765 0.078112 .  
## zipcode98109  3.388e+05  1.639e+05   2.067 0.039185 *  
## zipcode98148 -3.438e+04  9.242e+04  -0.372 0.710022    
## lat           4.904e+05  4.408e+05   1.112 0.266381    
## long         -7.261e+04  1.658e+05  -0.438 0.661591    
## yr_sold       1.178e+04  3.442e+04   0.342 0.732168    
## mn_sold       4.947e+03  4.976e+03   0.994 0.320530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 242200 on 588 degrees of freedom
## Multiple R-squared:  0.8334, Adjusted R-squared:  0.828 
## F-statistic: 154.8 on 19 and 588 DF,  p-value: < 2.2e-16

There are a few predictors that are highly correlated, but since our goal is prediction, it doesn’t hurt to keep them in the model as long as the coefficient estimates are stable.

cor(Seattle[,c(2:8)])

##                bedrooms   bathrooms sqft_living    sqft_lot      floors
## bedrooms     1.00000000  0.52232308  0.57374226 -0.01271588  0.21313254
## bathrooms    0.52232308  1.00000000  0.81334491  0.15817494  0.49353402
## sqft_living  0.57374226  0.81334491  1.00000000  0.21536597  0.39275312
## sqft_lot    -0.01271588  0.15817494  0.21536597  1.00000000  0.08409779
## floors       0.21313254  0.49353402  0.39275312  0.08409779  1.00000000
## waterfront  -0.14269036 -0.04460811 -0.03553199  0.01064276 -0.05450023
## sqft_above   0.53382408  0.76489684  0.92508713  0.26758259  0.46804074
##              waterfront  sqft_above
## bedrooms    -0.14269036  0.53382408
## bathrooms   -0.04460811  0.76489684
## sqft_living -0.03553199  0.92508713
## sqft_lot     0.01064276  0.26758259
## floors      -0.05450023  0.46804074
## waterfront   1.00000000 -0.04185858
## sqft_above  -0.04185858  1.00000000

For example, the correlation between sqrt_living and sqft_above is 0.9251.

Evaluate predictive power of regression model on test set

There are many ways to measure prediction error. One common way is the “root mean squared error” — the square root of the mean of the squared differences between the actual outcome value and the predicted value. Alternatively, one could use the “mean absolute error”, which is the mean of the absolute value of the differences between observed and predicted prices. The key is that we are measuring prediction error using the testing data set (why?).

rmse.lm2 <- sqrt(mean((test.set$price - predict(lm.2, test.set))^2))
rmse.lm2

## [1] 326654.2

mad.lm2 <- mean(abs(test.set$price - predict(lm.2, test.set)))
mad.lm2

## [1] 167670.9

Since the outcome variable is not standardized, there is no inherent meaning to these values. Instead, we use them to compare models.

Note that some of the predicted values are negative. This is one limitation to linear regression — there is no restriction on the range of predicted values, even though selling price can never be negative.

summary(predict(lm.2, test.set))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -138906  252590  542236  649028  852676 3981180

Polynomial regression model

Had we done some exploratory data analysis with the data set prior to fitting prediction models (which is always recommended!), we would have noticed that not all relationships between quantitative predictors and the outcome price appear linear. In particular, the relationship between sqft_living and price looks quadratic:

sqft.plot <- ggplot(data=Seattle, aes(x = sqft_living, y = price)) + 
  geom_point(alpha=.5) + 
  geom_smooth(method='loess') + 
  labs(title='King County Housing Sales', 
       xlab='Living Space (sqft)',
       ylab='Closing Price ($)')
sqft.plot

## `geom_smooth()` using formula 'y ~ x'

Let’s add a squared sqft_living term to the data set and use that as one of our predictors:

train.set$sqft_living2 <- train.set$sqft_living^2
test.set$sqft_living2 <- test.set$sqft_living^2

lm.3 <- lm(price ~ . - sqft_basement, data=train.set)
summary(lm.3)

## 
## Call:
## lm(formula = price ~ . - sqft_basement, data = train.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -982749  -79310    -645   73588 1256862 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.878e+07  6.520e+07  -0.902  0.36769    
## bedrooms     -1.653e+04  1.174e+04  -1.408  0.15977    
## bathrooms     2.280e+03  1.698e+04   0.134  0.89323    
## sqft_living  -6.637e+01  3.040e+01  -2.183  0.02943 *  
## sqft_lot      3.031e-02  8.758e-02   0.346  0.72944    
## floors        1.238e+04  2.169e+04   0.571  0.56827    
## waterfront    2.091e+05  5.316e+04   3.934 9.36e-05 ***
## sqft_above    6.673e+01  2.571e+01   2.596  0.00968 ** 
## zipcode98014 -1.109e+05  1.265e+05  -0.877  0.38093    
## zipcode98024 -3.988e+04  8.986e+04  -0.444  0.65732    
## zipcode98032 -1.374e+05  5.657e+04  -2.428  0.01547 *  
## zipcode98039  9.457e+05  1.296e+05   7.297 9.59e-13 ***
## zipcode98070 -5.961e+04  8.604e+04  -0.693  0.48868    
## zipcode98102  1.934e+05  1.366e+05   1.416  0.15738    
## zipcode98109  2.636e+05  1.383e+05   1.906  0.05718 .  
## zipcode98148 -1.090e+05  7.810e+04  -1.395  0.16341    
## lat           4.936e+05  3.718e+05   1.328  0.18475    
## long         -1.789e+05  1.400e+05  -1.278  0.20189    
## yr_sold       6.826e+03  2.903e+04   0.235  0.81419    
## mn_sold       3.326e+03  4.198e+03   0.792  0.42856    
## sqft_living2  5.301e-02  3.425e-03  15.477  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204300 on 587 degrees of freedom
## Multiple R-squared:  0.8817, Adjusted R-squared:  0.8777 
## F-statistic: 218.8 on 20 and 587 DF,  p-value: < 2.2e-16

rmse.lm3 <- sqrt(mean((test.set$price - predict(lm.3, test.set))^2)) 
rmse.lm2

## [1] 326654.2

rmse.lm3

## [1] 197585

mad.lm3 <- mean(abs(test.set$price - predict(lm.3, test.set)))
mad.lm2

## [1] 167670.9

mad.lm3

## [1] 124758.2

Decision tree

tree1 <- rpart(price ~ ., data = train.set, method = 'anova')
plot(tree1)
text(tree1)

print(tree1)

## n= 608 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 608 2.071105e+14  624046.4  
##    2) sqft_living< 3985 572 6.091421e+13  523777.6  
##      4) zipcode=98010,98014,98024,98032,98070,98148 406 1.270083e+13  389711.6  
##        8) sqft_above< 1815 265 3.832833e+12  311606.0 *
##        9) sqft_above>=1815 141 4.213020e+12  536505.8 *
##      5) zipcode=98039,98102,98109 166 2.306838e+13  851674.1  
##       10) sqft_living< 2585 126 5.463168e+12  699777.8 *
##       11) sqft_living>=2585 40 5.540607e+12 1330148.0 *
##    3) sqft_living>=3985 36 4.907191e+13 2217206.0  
##      6) zipcode=98010,98014,98024,98032,98102 17 5.258377e+12 1460553.0 *
##      7) zipcode=98039,98109 19 2.537226e+13 2894211.0 *

To view what output is displayed when printing a tree object in R, type ?print.rpart in the console:

“A semi-graphical layout of the contents of x$frame is printed. Indentation is used to convey the tree topology. Information for each node includes the node number, split, size, deviance, and fitted value. For the”class” method, the class probabilities are also printed.”

node number = level of the tree
split = criteria
size = how many elements are under each criteria
deviance = generic metric of differences
fitted value = predicted price for that branch

rmse.tree1 <- sqrt(mean((test.set$price - predict(tree1, test.set))^2)) 
rmse.tree1

## [1] 487873.3

rmse.lm3

## [1] 197585

mad.tree1 <- mean(abs(test.set$price - predict(tree1, test.set)))
mad.tree1

## [1] 200319.5

mad.lm3

## [1] 124758.2

The decision tree does quite a bit worse than the polynomial regression. The polynomial regression gave a RMSE of rrmse.lm3and a MAD ofr mad.lm3, whereas the decision tree produced a RMSE of rrmse.tree1and a MAD ofr mad.tree1.

Random forest

Each time it creates a tree, it selects a subset of the predictor variables and a subset of the observations to use. Thus, it uses uncertainty to get different results for each tree, then averages across all trees.

rf1 <- randomForest(price ~ ., data = Seattle)
plot(rf1)

rmse.rf1 <- sqrt(mean((test.set$price - predict(rf1, test.set))^2)) 
rmse.rf1

## [1] 128589.3

rmse.lm2

## [1] 326654.2

mad.rf1 <- mean(abs(test.set$price - predict(rf1, test.set)))
mad.rf1

## [1] 51416.4

mad.lm2

## [1] 167670.9

Comparison across all three methods

rslts <- tibble(
  Method = c("Polynomial regression", "Decision tree", "Random forest"),
  RMSE = c(rmse.lm3, rmse.tree1, rmse.rf1),
  MAD = c(mad.lm3, mad.tree1, mad.rf1)
)
kable(rslts)

Method	RMSE	MAD
Polynomial regression	197585.0	124758.2
Decision tree	487873.3	200319.5
Random forest	128589.3	51416.4

Attribution

Example adapted from Dr. Hoegh’s STAT 408 notes.