Clustering

Clustering methods are a type of unsupervised statistical learning. These methods are also under the umbrella of “multivariate analysis”. In this case, there is no clear outcome variable. Instead, we use a set of variables to “cluster” the data into similar groups.

Simulate data from three clusters

set.seed(10)
cluster1 <- rmnorm(n=25, mean=c(.3,.2), varcov=diag(2)*.025)
cluster2 <- rmnorm(n=25, mean=c(.15,.75), varcov=diag(2)*.025)
cluster3 <- rmnorm(n=50, mean=c(.75,.6), varcov=diag(2)*.03)
combined <- rbind(cluster1, cluster2, cluster3)
plot(combined, type='n', axes=F, xlab='', ylab='')
points(cluster1, pch=16, col='dodgerblue')
points(cluster2, pch=16, col='forestgreen')
points(cluster3, pch=16, col='firebrick4')

Imagine that we don’t know the colors, and we would like to use the x-variable and y-variable to “cluster” the data into a given number of groups. The purpose of simulating data in this case is to be able to see how well each method works by comparing the cluster assignments from the model to the actual clusters we know from the simulation.

k-means clustering

K-means clustering is an algorithm that

starts with k randomly selected means,
assigns each observation to the cluster whose corresponding mean is closest,
computes the new means of the clusters by taking averages in each cluster, and
repeating steps 2 and 3 until the clusters do not change.

Type ?kmeans into the console to view the help file.

km <- kmeans(combined, 3)
km

## K-means clustering with 3 clusters of sizes 26, 30, 44
## 
## Cluster means:
##        [,1]      [,2]
## 1 0.1329025 0.8018608
## 2 0.2844651 0.1830810
## 3 0.7693055 0.6026478
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 1 3 3
##  [75] 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3
## 
## Within cluster sum of squares by cluster:
## [1] 1.175133 1.417926 1.915896
##  (between_SS / total_SS =  75.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Create plot to compare clusters produced by k-means algorithm
plot(combined, type='n', axes=F, xlab='', ylab='')
box()
# Plot points: color is from simulated cluster (truth) and number
#              is assigned cluster.
points(combined, pch = as.character(km$cluster),
       col = c(rep('dodgerblue',25),
               rep('forestgreen',25),
               rep('firebrick4',50))
       )

The extent of discrimination among the clusters is measured by comparing the “within cluster sum of squares” to the “total sum of squares”. The total sum of squares (\(SST\)) is the sum of squared distances from each observation to the overall mean ( or overall centroid) of the points. If we call this SST, it can be partitioned into a sum of “within” and “between” cluster sums of squares:

\[ SST = SSW + SSB \] where \(SSW\) is the sum of squared distances of each observation to its respective cluster mean, and \(SSB\) is the sum of squared distances of the cluster means to the overall mean. The fraction \(SSB/SST\) (or \(1 - SSW/SST\)) then gives the proportion of variability in the data that is reduced by clustering.

This is analogous to the concept of \(R^2\) in regression, where the total variability in the response variable can be partitioned into the residual sum of squares plus the sum of squared distances from the predicted values to the mean response: \[ \sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n(y_i - \hat{y}_i)^2 + \sum_{i=1}^n(\hat{y}_i - \bar{y})^2 \]

Note that the labels are arbitrary. If you ran the algorithm again, you might get the same clusters with different labels.

With this algorithm, you need to specify “k”, the number of clusters. If this isn’t known in advance (or a reasonable guess is not available), then typically researchers try several values of “k” and then examine a scree plot to determine an appropriate number of clusters for the observed data. The scree plot plots the within group sum of squares versus the number of clusters.

# Scree plot
## Evaluate within cluster sum of squares for 1 to 15 clusters
wss <- rep(0,15)
for (i in 1:15) {
  wss[i] <- sum(kmeans(combined, centers=i)$withinss)
} 
plot(1:15, wss, type="b", 
     xlab="Number of Clusters", 
     ylab="Within groups sum of squares")

Multidimensional scaling and hierarchical clustering

Clustering is much harder to visualize when we have more than two variables. In the case of multi-dimensional data, we can use a method such as multidimensional scaling, which is similar to principal components analysis. The goal is to take multi-dimensional data and visualize it in a 2-dimensional space where the distance between points in 2-dimensions represents a kind of “multivariate” distance between the points in the higher dimensional space.

The data set we’ll use in this section is a built-in data set called animals in the cluster library. It considers six binary attributes for 20 animals. The original data set contained a few errors, and the code below corrects those errors.

animals <- cluster::animals
colnames(animals) <- c("warm-blooded", 
                       "can fly",
                       "vertebrate",
                       "endangered",
                       "live in groups",
                       "have hair")
# Delete a few variables and clean up data
animals.cluster <- animals[,-(5)]
animals.cluster <- animals.cluster[-c(4,5,12,16,18),]
animals.cluster[10,4] <- 2
animals.cluster[14,4] <- 1

We can use the dist function to calculate Euclidian distances between all pairs of animals in the data set.

d <- dist(animals.cluster) 
d

##          ant      bee      cat      cow      duc      eag      ele      fly
## bee 1.414214                                                               
## cat 1.732051 1.732051                                                      
## cow 1.732051 1.732051 0.000000                                             
## duc 1.732051 1.732051 1.414214 1.414214                                    
## eag 2.000000 2.000000 1.732051 1.732051 1.000000                           
## ele 1.732051 2.236068 1.414214 1.414214 1.414214 1.000000                  
## fly 1.000000 1.000000 2.000000 2.000000 1.414214 1.732051 2.000000         
## fro 1.414214 2.000000 1.732051 1.732051 1.732051 1.414214 1.000000 1.732051
## lio 2.000000 2.000000 1.000000 1.000000 1.732051 1.414214 1.000000 2.236068
## liz 1.000000 1.732051 1.414214 1.414214 1.414214 1.732051 1.414214 1.414214
## lob 0.000000 1.414214 1.732051 1.732051 1.732051 2.000000 1.732051 1.000000
## rab 1.732051 1.732051 0.000000 0.000000 1.414214 1.732051 1.414214 2.000000
## spi 1.000000 1.000000 1.414214 1.414214 2.000000 2.236068 2.000000 1.414214
## wha 1.732051 2.236068 1.414214 1.414214 1.414214 1.000000 0.000000 2.000000
##          fro      lio      liz      lob      rab      spi
## bee                                                      
## cat                                                      
## cow                                                      
## duc                                                      
## eag                                                      
## ele                                                      
## fly                                                      
## fro                                                      
## lio 1.414214                                             
## liz 1.000000 1.732051                                    
## lob 1.414214 2.000000 1.000000                           
## rab 1.732051 1.000000 1.414214 1.732051                  
## spi 1.732051 1.732051 1.414214 1.000000 1.414214         
## wha 1.000000 1.000000 1.414214 1.732051 1.414214 2.000000

The cmdscale function uses multidimensional scaling (or principal coordinates analysis) on the distance matrix to reduce the dimensionality of the data and visualize distances in a lower dimension (typically 2).

fit <- cmdscale(d, k = 2) 

plot(fit[,1], fit[,2], xlab="", ylab="",   main="", type="n",axes=F)
box()
text(fit[,1], fit[,2], labels = row.names(animals.cluster), cex=1)

Then, hclust will perform hierarchical clustering on the distance matrix, which is an algorithm that starts with \(n\) clusters (each observation is its own cluster), and then iteratively joins the two most similar clusters until a single cluster remains. This can be visualized in a dendogram.

# complete: maximum difference between elements in clusters
# single: minimum difference
hc <- hclust(dist(animals.cluster))
plot(hc, hang=-1)

# obtaining cluster assignments
cutree(hc, 4)

## ant bee cat cow duc eag ele fly fro lio liz lob rab spi wha 
##   1   1   2   2   3   3   4   1   4   2   4   1   2   1   4