require(clusterpval)
require(fastcluster)
Hierarchical Clustering Tutorial
In this tutorial, we demonstrate how to use the clusterpval
package to compute p-values for a difference in means between two clusters obtained by applying hierarchical clustering (with squared Euclidean distance) to a data set.
First, load the package:
We also load the fastcluster
package, which is a highly computationally efficient drop-in replacement for the hclust
function in the stats
package. This IS NOT optional if you are using complete linkage hierarchical clustering, but IS optional if you are using any other linkage (e.g. average).
Plotting and clustering data
We will illustrate the software on a subset of the Palmer penguins data, which contains data on three species of penguins: Adelie, Chinstrap, and Gentoo.
require(palmerpenguins)
require(ggplot2)
options(ggplot2.discrete.colour=list(RColorBrewer::brewer.pal(6, "Dark2")[c(6, 1, 5, 4, 3, 2)]))
<- penguins[complete.cases(penguins), ]
dat <- dat[dat$sex == "female", c(1, 3, 5)]
dat
ggplot(dat) + geom_point(aes(x=flipper_length_mm, y = bill_length_mm,
shape=as.factor(species)), size = 3, fill="grey", colour="black") +
scale_shape_manual(name="Species", values=c(21, 24, 22)) +
ylab("Bill length (mm)") + xlab("Flipper length (mm)") + coord_fixed() +
theme_bw(base_size=22) + ggtitle("Penguins") + theme(legend.position="right")
Let’s cluster the data using average linkage hierarchical clustering with squared Euclidean distance. We plot the dendrogram, and cut the dendrogram to get six clusters.
<- as.matrix(dat[, -c(1)]) # remove species and convert to matrix
X <- hclust(dist(X, method="euclidean")^2, method="average")
hcl plot(as.dendrogram(hcl), leaflab="none")
abline(h=(hcl$height[nrow(X) - 6] + 50), lty="dashed", col="darkgrey")
Now let’s pick pairs of clusters to test. We “name” the six clusters according to the output of the cutree
function, which is not always the same as left-to-right ordering in the dendrogram. To figure out what each cluster in the dendrogram is called, we can use the rect_hier_clusters
function to put coloured rectangles around the six clusters.
plot(as.dendrogram(hcl), leaflab="none")
abline(h=(hcl$height[nrow(X) - 6] + 50), lty="dashed", col="darkgrey")
rect_hier_clusters(hcl, k=6, which=1:6, border=RColorBrewer::brewer.pal(6, "Dark2")[c(6, 1, 5, 4, 3, 2)])
So visually, if we would like to test for a difference in means between the blue and green clusters, we would set k1
to be 4 and k2
to be 5. (You can see this from the order of the colors in the “border” argument above.)
table(dat$species, cutree(hcl,k=6))
1 2 3 4 5 6
Adelie 60 12 1 0 0 0
Chinstrap 4 2 0 0 27 1
Gentoo 0 0 0 57 1 0
ggplot(dat) + geom_point(aes(x=flipper_length_mm, y = bill_length_mm,
shape=as.factor(species), fill=as.factor(cutree(hcl, 6))), size = 3, colour="black") + scale_fill_discrete(name="Clusters", guide=guide_legend(ncol=2, override.aes=list(shape=21))) +
scale_shape_manual(name="Species", values=c(21, 24, 22), guide=guide_legend(override.aes=list(fill="black"))) +
ylab("Bill length (mm)") + xlab("Flipper length (mm)") + coord_fixed() +
theme_bw(base_size=22) + ggtitle("Penguins") + theme(legend.position="right")
You can see that Clusters 1 and 2 both mostly contain Adelie penguins, Cluster 4 mostly contains Gentoo Penguins, and Cluster 5 mostly contains Chinstrap penguins. Clusters 3 and 6 contain one penguin each.
Testing for a difference in means between clusters
We’ll test for a difference in means between Clusters 1 and 2 (both containing Adelie penguins), and between Clusters 4 and 5 (containing mostly Gentoo and Chinstrap penguins, respectively) using the test_hier_clusters_exact
function. By default, this function plugs in a simple estimate of \(\sigma^2\) given by \(\sum \limits_{i=1}^n \sum \limits_{j=1}^p (x_{ij} - \bar{x}_j)^2/(np - p)\), where \(\bar{x}_j\) is the mean of the \(j\)th feature. Note that if there really are clusters in the data, then this estimate will be larger than it should be, but if there really are no clusters in the data, then this estimate will be unbiased and consistent.
test_hier_clusters_exact(X, link="average", K=6, k1=1, k2=2, hcl=hcl)
$stat
[1] 10.41961
$pval
[1] 0.8667368
$trunc
Object of class Intervals
2 intervals over R:
(10.3122059802435, 16.3904453676166)
(131.758884280403, Inf)
test_hier_clusters_exact(X, link="average", K=6, k1=4, k2=5, hcl=hcl)
$stat
[1] 18.86523
$pval
[1] 0.0004528178
$trunc
Object of class Intervals
2 intervals over R:
(16.8300111004978, 21.6868651391918)
(63.4548051430845, Inf)
We now have the test statistic, exact p-value, and conditioning set \(\mathcal{S}\). We get a small p-value when testing for a difference in means between clusters containing different species, and a large p-value when testing for a difference in means between clusters containing the same species.
Now, let’s try complete linkage hierarchical clustering. We plot the dendrogram, and cut the dendrogram to get three clusters.
<- hclust(dist(X, method="euclidean")^2, method="complete")
hcl plot(as.dendrogram(hcl), leaflab="none")
abline(h=(hcl$height[nrow(X) - 3] + 200), lty="dashed", col="darkgrey")
rect_hier_clusters(hcl, k=3, which=1:3, border=c("orange", "blue", "green"))
table(dat$species, cutree(hcl, 3))
1 2 3
Adelie 67 6 0
Chinstrap 20 14 0
Gentoo 0 1 57
Observe that Clusters 1 and 2 (which are orange and blue in the dendrogram above) are both mixes of Adelie and Chinstrap penguins, and Cluster 3 (which is green in the dendrogram above) is largely Gentoo penguins. We’ll test for a difference in means between Clusters 1 and 3 using the test_complete_hier_clusters_approx
function. This approximately computes a p-value for the difference in means using Monte Carlo sampling. By default, this function also plugs in the simple estimate of \(\sigma^2\) given by \(\sum \limits_{i=1}^n \sum \limits_{j=1}^p (x_{ij} - \bar{x}_j)^2/(np-p)\).
set.seed(123)
test_complete_hier_clusters_approx(X, K=3, k1=2, k2=3, ndraws=10000, hcl=hcl)
$stat
[1] 15.36703
$pval
[1] 0.004396524
$stderr
[1] 0.0004490352
In the results above, the estimated p-value comes from Monte Carlo, which means that it is subject to Monte Carlo sampling error. Thus, we also report a standard error estimate for the p-value, that captures the uncertainty due to Monte Carlo sampling error. If more precision is desired, you could adjust the number of Monte Carlo samples using the ndraws
argument of test_complete_hier_clusters_approx
.
Observe that the estimated p-value for a difference in means between Cluster 1 and Cluster 3 is small - this is good, because the penguin species are different in the two clusters.
© 2020 Lucy L. Gao (lucylgao at uwaterloo dot ca)