This example will demonstrate predictive modeling of a quantitative outcome variable using the following methods:
Seattle Housing Data (King County)
Seattle <- read.csv('https://math.montana.edu/shancock/data/SeattleHousing.csv')
head(Seattle)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront sqft_above
## 1 1350000 3 2.50 2753 65005 1.0 1 2165
## 2 228000 3 1.00 1190 9199 1.0 0 1190
## 3 289000 3 1.75 1260 8400 1.0 0 1260
## 4 720000 4 2.50 3450 39683 2.0 0 3450
## 5 247500 3 1.75 1960 15681 1.0 0 1960
## 6 850830 3 2.50 2070 13241 1.5 0 1270
## sqft_basement zipcode lat long yr_sold mn_sold
## 1 588 98070 47.4041 -122.451 2015 3
## 2 0 98148 47.4258 -122.322 2014 9
## 3 0 98148 47.4366 -122.335 2014 8
## 4 0 98010 47.3420 -122.025 2015 3
## 5 0 98032 47.3576 -122.277 2015 3
## 6 800 98102 47.6415 -122.315 2014 6
skim(Seattle)
Name | Seattle |
Number of rows | 869 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
price | 0 | 1 | 632591.46 | 634826.07 | 80000.00 | 289950.00 | 464950.00 | 710000.00 | 7700000.00 | ▇▁▁▁▁ |
bedrooms | 0 | 1 | 3.20 | 0.92 | 0.00 | 3.00 | 3.00 | 4.00 | 9.00 | ▁▇▃▁▁ |
bathrooms | 0 | 1 | 2.08 | 0.91 | 0.00 | 1.50 | 2.00 | 2.50 | 8.00 | ▅▇▂▁▁ |
sqft_living | 0 | 1 | 2114.43 | 1152.25 | 290.00 | 1350.00 | 1850.00 | 2600.00 | 12050.00 | ▇▂▁▁▁ |
sqft_lot | 0 | 1 | 46872.30 | 106328.12 | 713.00 | 6000.00 | 10350.00 | 32340.00 | 1164794.00 | ▇▁▁▁▁ |
floors | 0 | 1 | 1.51 | 0.54 | 1.00 | 1.00 | 1.50 | 2.00 | 3.50 | ▇▅▁▁▁ |
waterfront | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
sqft_above | 0 | 1 | 1855.57 | 1018.84 | 290.00 | 1180.00 | 1590.00 | 2180.00 | 8860.00 | ▇▃▁▁▁ |
sqft_basement | 0 | 1 | 258.86 | 440.10 | 0.00 | 0.00 | 0.00 | 460.00 | 3480.00 | ▇▂▁▁▁ |
zipcode | 0 | 1 | 98057.44 | 43.36 | 98010.00 | 98014.00 | 98039.00 | 98102.00 | 98148.00 | ▇▁▂▃▁ |
lat | 0 | 1 | 47.52 | 0.13 | 47.29 | 47.38 | 47.56 | 47.64 | 47.72 | ▆▅▁▆▇ |
long | 0 | 1 | -122.19 | 0.23 | -122.52 | -122.34 | -122.29 | -121.96 | -121.32 | ▇▂▅▁▁ |
yr_sold | 0 | 1 | 2014.29 | 0.46 | 2014.00 | 2014.00 | 2014.00 | 2015.00 | 2015.00 | ▇▁▁▁▃ |
mn_sold | 0 | 1 | 6.76 | 3.10 | 1.00 | 4.00 | 7.00 | 9.00 | 12.00 | ▆▇▇▆▇ |
set.seed(05032022)
num.houses <- nrow(Seattle)
# need zipcode to be factor for modeling and clustering
Seattle$zipcode <- as.factor(Seattle$zipcode)
test.ids <- base::sample(1:num.houses, size=round(num.houses*.3))
test.set <- Seattle[test.ids,]
train.set <- Seattle[(1:num.houses)[!(1:num.houses) %in%
test.ids],]
Predict price using all available predictors in the data set.
lm.1 <- lm(price ~ ., data=train.set)
summary(lm.1)
##
## Call:
## lm(formula = price ~ ., data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -785259 -113474 1493 101330 2594189
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.592e+07 7.730e+07 -0.723 0.469751
## bedrooms -6.018e+04 1.352e+04 -4.452 1.02e-05 ***
## bathrooms 1.103e+04 2.012e+04 0.548 0.583599
## sqft_living 2.284e+02 2.810e+01 8.127 2.60e-15 ***
## sqft_lot -8.747e-02 1.034e-01 -0.846 0.398129
## floors -5.778e+04 2.514e+04 -2.298 0.021920 *
## waterfront 2.098e+05 6.303e+04 3.329 0.000927 ***
## sqft_above 1.468e+02 2.986e+01 4.915 1.15e-06 ***
## sqft_basement NA NA NA NA
## zipcode98014 -1.181e+05 1.500e+05 -0.787 0.431343
## zipcode98024 -4.584e+04 1.065e+05 -0.430 0.667125
## zipcode98032 -7.243e+04 6.688e+04 -1.083 0.279307
## zipcode98039 1.036e+06 1.535e+05 6.748 3.61e-11 ***
## zipcode98070 2.049e+03 1.019e+05 0.020 0.983967
## zipcode98102 2.855e+05 1.618e+05 1.765 0.078112 .
## zipcode98109 3.388e+05 1.639e+05 2.067 0.039185 *
## zipcode98148 -3.438e+04 9.242e+04 -0.372 0.710022
## lat 4.904e+05 4.408e+05 1.112 0.266381
## long -7.261e+04 1.658e+05 -0.438 0.661591
## yr_sold 1.178e+04 3.442e+04 0.342 0.732168
## mn_sold 4.947e+03 4.976e+03 0.994 0.320530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 242200 on 588 degrees of freedom
## Multiple R-squared: 0.8334, Adjusted R-squared: 0.828
## F-statistic: 154.8 on 19 and 588 DF, p-value: < 2.2e-16
Why would the model be giving us NA
s for the
sqft_basement
coefficients? Let’s investigate…
alias(lm.1)
## Model :
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + sqft_above + sqft_basement + zipcode + lat +
## long + yr_sold + mn_sold
##
## Complete :
## (Intercept) bedrooms bathrooms sqft_living sqft_lot floors
## sqft_basement 0 0 0 1 0 0
## waterfront sqft_above zipcode98014 zipcode98024 zipcode98032
## sqft_basement 0 -1 0 0 0
## zipcode98039 zipcode98070 zipcode98102 zipcode98109 zipcode98148
## sqft_basement 0 0 0 0 0
## lat long yr_sold mn_sold
## sqft_basement 0 0 0 0
The variables sqft_basement
is linearly dependent on the
other variables.
lin.comb <- Seattle$sqft_living - Seattle$sqft_above
cor(Seattle$sqft_basement, lin.comb)
## [1] 1
Let’s remove the redundant sqft_basement
variable:
lm.2 <- update(lm.1, . ~ . - sqft_basement)
summary(lm.2)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot +
## floors + waterfront + sqft_above + zipcode + lat + long +
## yr_sold + mn_sold, data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -785259 -113474 1493 101330 2594189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.592e+07 7.730e+07 -0.723 0.469751
## bedrooms -6.018e+04 1.352e+04 -4.452 1.02e-05 ***
## bathrooms 1.103e+04 2.012e+04 0.548 0.583599
## sqft_living 2.284e+02 2.810e+01 8.127 2.60e-15 ***
## sqft_lot -8.747e-02 1.034e-01 -0.846 0.398129
## floors -5.778e+04 2.514e+04 -2.298 0.021920 *
## waterfront 2.098e+05 6.303e+04 3.329 0.000927 ***
## sqft_above 1.468e+02 2.986e+01 4.915 1.15e-06 ***
## zipcode98014 -1.181e+05 1.500e+05 -0.787 0.431343
## zipcode98024 -4.584e+04 1.065e+05 -0.430 0.667125
## zipcode98032 -7.243e+04 6.688e+04 -1.083 0.279307
## zipcode98039 1.036e+06 1.535e+05 6.748 3.61e-11 ***
## zipcode98070 2.049e+03 1.019e+05 0.020 0.983967
## zipcode98102 2.855e+05 1.618e+05 1.765 0.078112 .
## zipcode98109 3.388e+05 1.639e+05 2.067 0.039185 *
## zipcode98148 -3.438e+04 9.242e+04 -0.372 0.710022
## lat 4.904e+05 4.408e+05 1.112 0.266381
## long -7.261e+04 1.658e+05 -0.438 0.661591
## yr_sold 1.178e+04 3.442e+04 0.342 0.732168
## mn_sold 4.947e+03 4.976e+03 0.994 0.320530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 242200 on 588 degrees of freedom
## Multiple R-squared: 0.8334, Adjusted R-squared: 0.828
## F-statistic: 154.8 on 19 and 588 DF, p-value: < 2.2e-16
There are a few predictors that are highly correlated, but since our goal is prediction, it doesn’t hurt to keep them in the model as long as the coefficient estimates are stable.
cor(Seattle[,c(2:8)])
## bedrooms bathrooms sqft_living sqft_lot floors
## bedrooms 1.00000000 0.52232308 0.57374226 -0.01271588 0.21313254
## bathrooms 0.52232308 1.00000000 0.81334491 0.15817494 0.49353402
## sqft_living 0.57374226 0.81334491 1.00000000 0.21536597 0.39275312
## sqft_lot -0.01271588 0.15817494 0.21536597 1.00000000 0.08409779
## floors 0.21313254 0.49353402 0.39275312 0.08409779 1.00000000
## waterfront -0.14269036 -0.04460811 -0.03553199 0.01064276 -0.05450023
## sqft_above 0.53382408 0.76489684 0.92508713 0.26758259 0.46804074
## waterfront sqft_above
## bedrooms -0.14269036 0.53382408
## bathrooms -0.04460811 0.76489684
## sqft_living -0.03553199 0.92508713
## sqft_lot 0.01064276 0.26758259
## floors -0.05450023 0.46804074
## waterfront 1.00000000 -0.04185858
## sqft_above -0.04185858 1.00000000
For example, the correlation between sqrt_living
and
sqft_above
is 0.9251.
There are many ways to measure prediction error. One common way is the “root mean squared error” — the square root of the mean of the squared differences between the actual outcome value and the predicted value. Alternatively, one could use the “mean absolute error”, which is the mean of the absolute value of the differences between observed and predicted prices. The key is that we are measuring prediction error using the testing data set (why?).
rmse.lm2 <- sqrt(mean((test.set$price - predict(lm.2, test.set))^2))
rmse.lm2
## [1] 326654.2
mad.lm2 <- mean(abs(test.set$price - predict(lm.2, test.set)))
mad.lm2
## [1] 167670.9
Since the outcome variable is not standardized, there is no inherent meaning to these values. Instead, we use them to compare models.
Note that some of the predicted values are negative. This is one limitation to linear regression — there is no restriction on the range of predicted values, even though selling price can never be negative.
summary(predict(lm.2, test.set))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -138906 252590 542236 649028 852676 3981180
Had we done some exploratory data analysis with the data set prior to
fitting prediction models (which is always recommended!), we would have
noticed that not all relationships between quantitative predictors and
the outcome price
appear linear. In particular, the
relationship between sqft_living
and price
looks quadratic:
sqft.plot <- ggplot(data=Seattle, aes(x = sqft_living, y = price)) +
geom_point(alpha=.5) +
geom_smooth(method='loess') +
labs(title='King County Housing Sales',
xlab='Living Space (sqft)',
ylab='Closing Price ($)')
sqft.plot
## `geom_smooth()` using formula 'y ~ x'
Let’s add a squared sqft_living
term to the data set and
use that as one of our predictors:
train.set$sqft_living2 <- train.set$sqft_living^2
test.set$sqft_living2 <- test.set$sqft_living^2
lm.3 <- lm(price ~ . - sqft_basement, data=train.set)
summary(lm.3)
##
## Call:
## lm(formula = price ~ . - sqft_basement, data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -982749 -79310 -645 73588 1256862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.878e+07 6.520e+07 -0.902 0.36769
## bedrooms -1.653e+04 1.174e+04 -1.408 0.15977
## bathrooms 2.280e+03 1.698e+04 0.134 0.89323
## sqft_living -6.637e+01 3.040e+01 -2.183 0.02943 *
## sqft_lot 3.031e-02 8.758e-02 0.346 0.72944
## floors 1.238e+04 2.169e+04 0.571 0.56827
## waterfront 2.091e+05 5.316e+04 3.934 9.36e-05 ***
## sqft_above 6.673e+01 2.571e+01 2.596 0.00968 **
## zipcode98014 -1.109e+05 1.265e+05 -0.877 0.38093
## zipcode98024 -3.988e+04 8.986e+04 -0.444 0.65732
## zipcode98032 -1.374e+05 5.657e+04 -2.428 0.01547 *
## zipcode98039 9.457e+05 1.296e+05 7.297 9.59e-13 ***
## zipcode98070 -5.961e+04 8.604e+04 -0.693 0.48868
## zipcode98102 1.934e+05 1.366e+05 1.416 0.15738
## zipcode98109 2.636e+05 1.383e+05 1.906 0.05718 .
## zipcode98148 -1.090e+05 7.810e+04 -1.395 0.16341
## lat 4.936e+05 3.718e+05 1.328 0.18475
## long -1.789e+05 1.400e+05 -1.278 0.20189
## yr_sold 6.826e+03 2.903e+04 0.235 0.81419
## mn_sold 3.326e+03 4.198e+03 0.792 0.42856
## sqft_living2 5.301e-02 3.425e-03 15.477 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 204300 on 587 degrees of freedom
## Multiple R-squared: 0.8817, Adjusted R-squared: 0.8777
## F-statistic: 218.8 on 20 and 587 DF, p-value: < 2.2e-16
rmse.lm3 <- sqrt(mean((test.set$price - predict(lm.3, test.set))^2))
rmse.lm2
## [1] 326654.2
rmse.lm3
## [1] 197585
mad.lm3 <- mean(abs(test.set$price - predict(lm.3, test.set)))
mad.lm2
## [1] 167670.9
mad.lm3
## [1] 124758.2
tree1 <- rpart(price ~ ., data = train.set, method = 'anova')
plot(tree1)
text(tree1)
print(tree1)
## n= 608
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 608 2.071105e+14 624046.4
## 2) sqft_living< 3985 572 6.091421e+13 523777.6
## 4) zipcode=98010,98014,98024,98032,98070,98148 406 1.270083e+13 389711.6
## 8) sqft_above< 1815 265 3.832833e+12 311606.0 *
## 9) sqft_above>=1815 141 4.213020e+12 536505.8 *
## 5) zipcode=98039,98102,98109 166 2.306838e+13 851674.1
## 10) sqft_living< 2585 126 5.463168e+12 699777.8 *
## 11) sqft_living>=2585 40 5.540607e+12 1330148.0 *
## 3) sqft_living>=3985 36 4.907191e+13 2217206.0
## 6) zipcode=98010,98014,98024,98032,98102 17 5.258377e+12 1460553.0 *
## 7) zipcode=98039,98109 19 2.537226e+13 2894211.0 *
To view what output is displayed when printing a tree object in R,
type ?print.rpart
in the console:
“A semi-graphical layout of the contents of x$frame is printed. Indentation is used to convey the tree topology. Information for each node includes the node number, split, size, deviance, and fitted value. For the”
class
” method, the class probabilities are also printed.”
rmse.tree1 <- sqrt(mean((test.set$price - predict(tree1, test.set))^2))
rmse.tree1
## [1] 487873.3
rmse.lm3
## [1] 197585
mad.tree1 <- mean(abs(test.set$price - predict(tree1, test.set)))
mad.tree1
## [1] 200319.5
mad.lm3
## [1] 124758.2
The decision tree does quite a bit worse than the polynomial
regression. The polynomial regression gave a RMSE of
r
rmse.lm3and a MAD of
r mad.lm3
,
whereas the decision tree produced a RMSE of
r
rmse.tree1and a MAD of
r
mad.tree1
.
Each time it creates a tree, it selects a subset of the predictor variables and a subset of the observations to use. Thus, it uses uncertainty to get different results for each tree, then averages across all trees.
rf1 <- randomForest(price ~ ., data = Seattle)
plot(rf1)
rmse.rf1 <- sqrt(mean((test.set$price - predict(rf1, test.set))^2))
rmse.rf1
## [1] 128589.3
rmse.lm2
## [1] 326654.2
mad.rf1 <- mean(abs(test.set$price - predict(rf1, test.set)))
mad.rf1
## [1] 51416.4
mad.lm2
## [1] 167670.9
rslts <- tibble(
Method = c("Polynomial regression", "Decision tree", "Random forest"),
RMSE = c(rmse.lm3, rmse.tree1, rmse.rf1),
MAD = c(mad.lm3, mad.tree1, mad.rf1)
)
kable(rslts)
Method | RMSE | MAD |
---|---|---|
Polynomial regression | 197585.0 | 124758.2 |
Decision tree | 487873.3 | 200319.5 |
Random forest | 128589.3 | 51416.4 |
Example adapted from Dr. Hoegh’s STAT 408 notes.