Goal: Find the linear combination of original variables that provide the best possible separation between groups.
library(tidyverse)
library(skimr) # skim
library(psych) # pairs.panels
library(MASS) # lda
data(iris) # data set built into base R
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
skim(iris)
Name | iris |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
Type ?iris
in the console to view description.
pairs.panels(iris[1:4],
gap = 0,
bg = c("red", "green", "blue")[iris$Species],
pch = 21)
set.seed(123)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.7, 0.3))
training <- iris[ind==1,]
testing <- iris[ind==2,]
Using .
for the right side of the model formula tells R
to use all available explanatory variables in the data set.
linear <- lda(Species ~ ., training)
linear
## Call:
## lda(Species ~ ., data = training)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3301887 0.3396226 0.3301887
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.940000 3.405714 1.445714 0.2428571
## versicolor 5.908333 2.766667 4.202778 1.3027778
## virginica 6.634286 2.925714 5.565714 2.0428571
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.5329061 0.2942187
## Sepal.Width 2.1256776 1.9325511
## Petal.Length -1.9624296 -1.1434715
## Petal.Width -3.5611214 3.0035717
##
## Proportion of trace:
## LD1 LD2
## 0.993 0.007
The “prior probabilities of groups” are just the sample proportions:
training %>% group_by(Species) %>% summarize(sample_proportion = n()/length(training[,1]))
## # A tibble: 3 × 2
## Species sample_proportion
## <fct> <dbl>
## 1 setosa 0.330
## 2 versicolor 0.340
## 3 virginica 0.330
The entries in LD1
and LD2
are the
coefficients of the linear combinations of explanatory variables that
achieve maximum separation of the data between groups. The linear
discriminant functions are scaled to have mean 0 and variance 1 in the
training data set.
The percentage separation achieved by the first discriminant function is 99.3% and second is 0.7%
We can plot stacked histograms of the values of the first discriminant function for the training data set and visualize the degree of separation.
p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)
The second discriminant function does not separate the groups well.
ldahist(data = p$x[,2], g = training$Species)
To view the separation in two dimensions:
# Create data frame with species, LDA1, LDA2 values in training data set
newdata <- data.frame(Species = training[,5], lda = p$x)
ggplot(newdata) +
geom_point(aes(x = lda.LD1, y = lda.LD2, colour = Species), size = 2.5) +
labs(x = "LDA1", y = "LDA2")
Compare actual group to predicted group on training data:
p1 <- predict(linear, training)$class
tab1 <- table(Predicted = p1, Actual = training$Species)
tab1
## Actual
## Predicted setosa versicolor virginica
## setosa 35 0 0
## versicolor 0 36 0
## virginica 0 0 35
The number correctly classified in the training data set is 106, which gives a classification error rate of 0.
Now for the testing data:
p2 <- predict(linear, testing)$class
tab2 <- table(Predicted = p2, Actual = testing$Species)
The number correctly classified in the training data set is 42, which gives a classification error rate of 0.0454545.
Example taken from “Linear Discriminant Analysis in R” by finnstats in R bloggers: https://www.r-bloggers.com/2021/05/linear-discriminant-analysis-in-r/.