Background

Goal: Find the linear combination of original variables that provide the best possible separation between groups.

Load libraries

library(tidyverse)
library(skimr) # skim
library(psych) # pairs.panels
library(MASS)  # lda

Data

data(iris)  # data set built into base R
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
skim(iris)
Data summary
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

Type ?iris in the console to view description.

pairs.panels(iris[1:4],
             gap = 0,
             bg = c("red", "green", "blue")[iris$Species],
             pch = 21)

Create training (70%) and testing (30%) data

set.seed(123)
ind <- sample(2, nrow(iris),
              replace = TRUE,
              prob = c(0.7, 0.3))
training <- iris[ind==1,]
testing <- iris[ind==2,]

Linear discriminant analysis

Using . for the right side of the model formula tells R to use all available explanatory variables in the data set.

linear <- lda(Species ~ ., training)
linear
## Call:
## lda(Species ~ ., data = training)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3301887  0.3396226  0.3301887 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.940000    3.405714     1.445714   0.2428571
## versicolor     5.908333    2.766667     4.202778   1.3027778
## virginica      6.634286    2.925714     5.565714   2.0428571
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal.Length  0.5329061  0.2942187
## Sepal.Width   2.1256776  1.9325511
## Petal.Length -1.9624296 -1.1434715
## Petal.Width  -3.5611214  3.0035717
## 
## Proportion of trace:
##   LD1   LD2 
## 0.993 0.007

The “prior probabilities of groups” are just the sample proportions:

training %>% group_by(Species) %>% summarize(sample_proportion = n()/length(training[,1]))
## # A tibble: 3 × 2
##   Species    sample_proportion
##   <fct>                  <dbl>
## 1 setosa                 0.330
## 2 versicolor             0.340
## 3 virginica              0.330

Group separation

The entries in LD1 and LD2 are the coefficients of the linear combinations of explanatory variables that achieve maximum separation of the data between groups. The linear discriminant functions are scaled to have mean 0 and variance 1 in the training data set.

The percentage separation achieved by the first discriminant function is 99.3% and second is 0.7%

We can plot stacked histograms of the values of the first discriminant function for the training data set and visualize the degree of separation.

p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)

The second discriminant function does not separate the groups well.

ldahist(data = p$x[,2], g = training$Species)

To view the separation in two dimensions:

# Create data frame with species, LDA1, LDA2 values in training data set
newdata <- data.frame(Species = training[,5], lda = p$x)
ggplot(newdata) + 
  geom_point(aes(x = lda.LD1, y = lda.LD2, colour = Species), size = 2.5) +
  labs(x = "LDA1", y = "LDA2")

Prediction accuracy

Compare actual group to predicted group on training data:

p1 <- predict(linear, training)$class
tab1 <- table(Predicted = p1, Actual = training$Species)
tab1
##             Actual
## Predicted    setosa versicolor virginica
##   setosa         35          0         0
##   versicolor      0         36         0
##   virginica       0          0        35

The number correctly classified in the training data set is 106, which gives a classification error rate of 0.

Now for the testing data:

p2 <- predict(linear, testing)$class
tab2 <- table(Predicted = p2, Actual = testing$Species)

The number correctly classified in the training data set is 42, which gives a classification error rate of 0.0454545.

References

Example taken from “Linear Discriminant Analysis in R” by finnstats in R bloggers: https://www.r-bloggers.com/2021/05/linear-discriminant-analysis-in-r/.