Linear Discriminant Analysis

Background

Goal: Find the linear combination of original variables that provide the best possible separation between groups.

Developed by R. A. Fisher in 1936 (hence, the appropriateness of his Iris data as an example)
Dimension reduction method
Optimal when explanatory variables have normal distribution.

Load libraries

library(tidyverse)
library(skimr) # skim
library(psych) # pairs.panels
library(MASS)  # lda

Data

data(iris)  # data set built into base R
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

skim(iris)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Species	0	1	FALSE	3	set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

Type ?iris in the console to view description.

pairs.panels(iris[1:4],
             gap = 0,
             bg = c("red", "green", "blue")[iris$Species],
             pch = 21)

Create training (70%) and testing (30%) data

set.seed(123)
ind <- sample(2, nrow(iris),
              replace = TRUE,
              prob = c(0.7, 0.3))
training <- iris[ind==1,]
testing <- iris[ind==2,]

Linear discriminant analysis

Using . for the right side of the model formula tells R to use all available explanatory variables in the data set.

linear <- lda(Species ~ ., training)
linear

## Call:
## lda(Species ~ ., data = training)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3301887  0.3396226  0.3301887 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.940000    3.405714     1.445714   0.2428571
## versicolor     5.908333    2.766667     4.202778   1.3027778
## virginica      6.634286    2.925714     5.565714   2.0428571
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.5329061  0.2942187
## Sepal.Width   2.1256776  1.9325511
## Petal.Length -1.9624296 -1.1434715
## Petal.Width  -3.5611214  3.0035717
## 
## Proportion of trace:
##   LD1   LD2 
## 0.993 0.007

The “prior probabilities of groups” are just the sample proportions:

training %>% group_by(Species) %>% summarize(sample_proportion = n()/length(training[,1]))

## # A tibble: 3 × 2
##   Species    sample_proportion
##   <fct>                  <dbl>
## 1 setosa                 0.330
## 2 versicolor             0.340
## 3 virginica              0.330

Group separation

The entries in LD1 and LD2 are the coefficients of the linear combinations of explanatory variables that achieve maximum separation of the data between groups. The linear discriminant functions are scaled to have mean 0 and variance 1 in the training data set.

The percentage separation achieved by the first discriminant function is 99.3% and second is 0.7%

We can plot stacked histograms of the values of the first discriminant function for the training data set and visualize the degree of separation.

p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)

The second discriminant function does not separate the groups well.

ldahist(data = p$x[,2], g = training$Species)

To view the separation in two dimensions:

# Create data frame with species, LDA1, LDA2 values in training data set
newdata <- data.frame(Species = training[,5], lda = p$x)
ggplot(newdata) + 
  geom_point(aes(x = lda.LD1, y = lda.LD2, colour = Species), size = 2.5) +
  labs(x = "LDA1", y = "LDA2")

Prediction accuracy

Compare actual group to predicted group on training data:

p1 <- predict(linear, training)$class
tab1 <- table(Predicted = p1, Actual = training$Species)
tab1

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         35          0         0
##   versicolor      0         36         0
##   virginica       0          0        35

The number correctly classified in the training data set is 106, which gives a classification error rate of 0.

Now for the testing data:

p2 <- predict(linear, testing)$class
tab2 <- table(Predicted = p2, Actual = testing$Species)

The number correctly classified in the training data set is 42, which gives a classification error rate of 0.0454545.

References

Example taken from “Linear Discriminant Analysis in R” by finnstats in R bloggers: https://www.r-bloggers.com/2021/05/linear-discriminant-analysis-in-r/.