Hierarchical Cluster Analysis (HCA)
Introduction
The HCA is an unsupervised clustering approach mainly based on the distances of the measurements from each other. It is an agglomerative approach, thus starting with each individual measurement as a cluster and then grouping them to build a final cluster that includes all the measurements.
How?
The approach taken in HCA is very simple from programming point of view. The algorithm starts with the assumption that every individual measurement is a unique cluster. Then it calculates the pairwise distances between the measurements. The two measurements with the smallest distance are grouped together to form the first agglomerative cluster. In the next iteration, the newly generated cluster is then represented by either its mean, minimum or its maximum for the distance calculations. It should be noted that there are several ways to calculate the distance between two measurements (e.g. Euclidean distance and Mahalanobis distance). For simplicity, we are only going to look at the "Euclidean distance" here.
In a one dimensional space, the distance between points $x_{n}$ and $x_{m}$ is calculated by subtracting the two points from each other.
\[ d_{m,n} = x_{n} - x_{m} \]
Assuming the below dataset with vectors X and Y as the coordinates.
using ACS
# Generating the data
cx1 = 1 .+ (2-1) .* rand(5) # 5 random values between 1 and 2
c1 = 5 .* rand(5) # 5 random values around 5
cx2 = 4 .+ (6-4) .* rand(5) # 5 random values between 4 and 6
c2 = 10 .* rand(5) # 5 random values around 10
Y = vcat(c1[:],c2[:]) # building matrix Y
X = vcat(cx1[:],cx2[:]) # building the matrix X
# Plotting the data
scatter(cx1,c1,label = "Cluster 1")
scatter!(cx2,c2,label = "Cluster 2")
xlabel!("X")
ylabel!("Y")
We first plot the data in the one dimensional data. In other words, we are setting the y values to zero in our data matrix. Below you can see how this is done.
using ACS
# Plotting the data
scatter(X,zeros(length(X[:])),label = "Data")
xlabel!("X")
ylabel!("Y")
The next step is to calculate the distances in the x domain. For that we are using the Euclidean distances. Here we need to calculate the distance between each point in the X and all the other values in the same matrix. This implies that we will end up with a square distance matrix (i.e. dist). The dist matrix has a zero diagonal, given that the values on the diagonal represent the distance between each point and itself. Also it is important to note that we are interested only in the magnitude of the distance but not its direction. Thus, you can use the abs.(-) to convert all the distances to their absolute values. Below you can see an example of a function for these calculations.
using ACS
function dist_calc(data)
dist = zeros(size(data,1),size(data,1)) # A square matrix is initiated
for i = 1:size(data,1)-1 # The nested loops create two unaligned vectors by one member
for j = i+1:size(data,1)
dist[j,i] = data[j,1] - data[i,1] # The generated vectors are subtracted from each other
end
end
dist += dist' # The upper diagonal is filled
return abs.(dist) # Make sure the order of subtraction does not affect the distances
end
dist = dist_calc(X)
10×10 Matrix{Float64}:
0.0 0.329446 0.158311 0.152547 … 2.57972 2.28923 2.52946
0.329446 0.0 0.487757 0.176899 2.90917 2.61868 2.8589
0.158311 0.487757 0.0 0.310859 2.42141 2.13092 2.37115
0.152547 0.176899 0.310859 0.0 2.73227 2.44178 2.68201
0.134164 0.46361 0.0241477 0.286711 2.44556 2.15507 2.3953
2.533 2.86244 2.37469 2.68554 … 0.0467274 0.243767 0.00353817
2.57521 2.90466 2.4169 2.72776 0.00451081 0.285984 0.0457547
2.57972 2.90917 2.42141 2.73227 0.0 0.290495 0.0502655
2.28923 2.61868 2.13092 2.44178 0.290495 0.0 0.240229
2.52946 2.8589 2.37115 2.68201 0.0502655 0.240229 0.0
In the next step we need to find the points in the X that have the smallest distance and should be grouped together as the first cluster. To do so we need to use the dist matrix. However, as you see in the dist matrix the smallest values are zero and are found in the diagonal. A way to deal with this is to set the diagonal to for example to Inf.
using ACS
#dist = dist_calc(X)
dist[diagind(dist)] .= Inf # Set the diagonal to inf, which is very helpful when searching for minimum distance
dist
10×10 Matrix{Float64}:
Inf 0.329446 0.158311 … 2.57972 2.28923 2.52946
0.329446 Inf 0.487757 2.90917 2.61868 2.8589
0.158311 0.487757 Inf 2.42141 2.13092 2.37115
0.152547 0.176899 0.310859 2.73227 2.44178 2.68201
0.134164 0.46361 0.0241477 2.44556 2.15507 2.3953
2.533 2.86244 2.37469 … 0.0467274 0.243767 0.00353817
2.57521 2.90466 2.4169 0.00451081 0.285984 0.0457547
2.57972 2.90917 2.42141 Inf 0.290495 0.0502655
2.28923 2.61868 2.13092 0.290495 Inf 0.240229
2.52946 2.8589 2.37115 0.0502655 0.240229 Inf
Now that we have the complete distances matrix, we can use the function argmin(-) to find the coordinates of the points with the minimum distance in the X.
using ACS
cl_temp = argmin(dist)
CartesianIndex(10, 6)
The selected points in the X are the closest to each other, indicating that they should be grouped into one cluster. In the next step, we will assume this cluster as a single point and thus we can repeat the distance calculations. For simplicity, we assume that the average of the two points is representative of that cluster. This process is called linkage and can be done using different approaches. Using the mean, in particular, is called centroid linkage. With centroid method, we are replacing these points with their average. This process can be repeated until all data points are grouped into one single cluster.
using ACS
X1 = deepcopy(X)
X1[cl_temp[1]] = mean([X[cl_temp[1]],X[cl_temp[2]]])
X1[cl_temp[2]] = mean([X[cl_temp[1]],X[cl_temp[2]]])
[X,X1]
2-element Vector{Vector{Float64}}:
[1.7363278739645736, 1.4068819067912344, 1.8946393154449965, 1.5837807991166608, 1.8704916536347878, 4.26932490276195, 4.311541464330659, 4.31605227652931, 4.025557550187984, 4.265786733237093]
[1.7363278739645736, 1.4068819067912344, 1.8946393154449965, 1.5837807991166608, 1.8704916536347878, 4.267555817999521, 4.311541464330659, 4.31605227652931, 4.025557550187984, 4.267555817999521]
using ACS
scatter(X,zeros(length(X[:])),label = "Original data")
scatter!(X1,0.01 .* ones(length(X1[:])),label = "Data after clustering",fillalpha = 0.5)
ylims!(-0.01,0.1)
xlabel!("X")
ylabel!("Y")
So far, we have done all our calculations based on one dimensional data. Now we can move towards two and more dimension. One of the main things to consider when increasing the number of dimensions is the distance calculations. In the above examples, we have use the Euclidean distance, which is one of many distance metrics. In general terms the Euclidean distance can be expressed as below, where $d_{m,n}$ represents the distance between points m and n. This is based on the Pythagorean distance.
\[d_{m,n} = \sqrt{\sum{(x_{m} - x_{n})^{2}}}. \]
Let's try to move to a two dimensional space rather than uni-dimensional one.
using ACS
# Plotting the data
scatter(cx1,c1,label = "Cluster 1")
scatter!(cx2,c2,label = "Cluster 2")
xlabel!("X")
ylabel!("Y")
The very first step to do so is to convert our 1D distances to 2D ones, using the below equation. If we replace the $dist[j,i] = data[j,1] - data[i,1]$ with the below equation in our distance function, we will be able to generate the distance matrix for our two dimensional dataset.
\[d_{m,n} = \sqrt{(x_{m} - x_{n})^{2} + (y_{m} - y_{n})^{2}}. \]
using ACS
function dist_calc(data)
dist = zeros(size(data,1),size(data,1)) # A square matrix is initiated
for i = 1:size(data,1)-1 # The nested loops create two unaligned vectors by one member
for j = i+1:size(data,1)
dist[j,i] = sqrt(sum((data[i,:] .- data[j,:]).^2)) # The generated vectors are subtracted from each other
end
end
dist += dist' # The upper diagonal is filled
return abs.(dist) # Make sure the order of subtraction does not affect the distances
end
data = hcat(X,Y) # To generate the DATA matrix
dist = dist_calc(data) # Calculating the distance matrix
dist[diagind(dist)] .= Inf # Set the diagonal to inf, which is very helpful when searching for minimum distance
dist
10×10 Matrix{Float64}:
Inf 3.36492 1.78709 0.193572 … 4.03866 2.6587 2.83583
3.36492 Inf 1.64277 3.47242 2.91917 5.381 3.52766
1.78709 1.64277 Inf 1.9245 2.76134 3.78828 2.42287
0.193572 3.47242 1.9245 Inf 4.22799 2.73539 3.026
1.17217 2.23294 0.616071 1.31526 3.12341 3.3132 2.39818
3.41795 3.05029 2.42984 3.61104 … 0.813877 3.65505 1.01275
3.14694 3.28766 2.41707 3.34028 1.29865 3.17372 0.528621
4.03866 2.91917 2.76134 4.22799 Inf 4.4689 1.82597
2.6587 5.381 3.78828 2.73539 4.4689 Inf 2.6451
2.83583 3.52766 2.42287 3.026 1.82597 2.6451 Inf
using ACS
cl_temp = argmin(dist)
data1 = deepcopy(data)
data1[cl_temp[1],1] = mean([data[cl_temp[1],1],data[cl_temp[2],1]])
data1[cl_temp[1],2] = mean([data[cl_temp[1],2],data[cl_temp[2],2]])
data1[cl_temp[2],1] = mean([data[cl_temp[1],1],data[cl_temp[2],1]])
data1[cl_temp[2],2] = mean([data[cl_temp[1],2],data[cl_temp[2],2]])
data1
10×2 Matrix{Float64}:
1.66005 4.64974
1.40688 1.24141
1.89464 2.81009
1.66005 4.64974
1.87049 3.42569
4.26932 2.29531
4.31154 2.78142
4.31605 1.48278
4.02556 5.94223
4.26579 3.30806
using ACS
scatter(data[:,1],data[:,2],label = "Original data")
scatter!(data1[:,1],data1[:,2],label = "Data after clustering")
xlabel!("X")
ylabel!("Y")
As it can be seen in the figure, there are two blue points and one red point in the middle of those points. These blue dots represent the two closest data points that are clustered together to form the centroid in between them. If we repeat this process multiple times, we eventually end up having all data points into one large cluster. The HCA clustering generates an array of clustered data points that can be visualized via a dendrogram or a heatmap.
using ACS
dist = dist_calc(data)
hc = hclust(dist, linkage=:average)
sp.plot(hc)
Practical Application
We can use either our home developed function for HCA or use the julia built-in functions. Here we will provide an easy tutorial on how to use the julia functions that are built-in the ACS.jl package.
For calculating the distances the function pairwise(-) via the julia package Distances.jl can be used. Function pairwise(-) has three inputs namely: 1) distance metrics, 2) data, and 3) the operation direction. This function outputs a square matrix similar to our distance matrix. As it can be seen from the distance matrix, our function and the pairwise(-) generate the same results, which is expected. The function pairwise(-) will give access to a wide variety of distance metrics that can be used for your projects.
using ACS
dist1 = pairwise(ACS.Euclidean(), data, dims=1) # Euclidean distance
# dist1 = pairwise(ACS.TotalVariation(), data, dims=1) # TotalVariation distance
10×10 Matrix{Float64}:
0.0 3.36492 1.78709 0.193572 … 4.03866 2.6587 2.83583
3.36492 0.0 1.64277 3.47242 2.91917 5.381 3.52766
1.78709 1.64277 0.0 1.9245 2.76134 3.78828 2.42287
0.193572 3.47242 1.9245 0.0 4.22799 2.73539 3.026
1.17217 2.23294 0.616071 1.31526 3.12341 3.3132 2.39818
3.41795 3.05029 2.42984 3.61104 … 0.813877 3.65505 1.01275
3.14694 3.28766 2.41707 3.34028 1.29865 3.17372 0.528621
4.03866 2.91917 2.76134 4.22799 0.0 4.4689 1.82597
2.6587 5.381 3.78828 2.73539 4.4689 0.0 2.6451
2.83583 3.52766 2.42287 3.026 1.82597 2.6451 0.0
For the HCA itself, you can use the function hclust(-) incorporated in the ACS.jl package and provided via Clustering.jl package. This function takes two inputs, the distance matrix and the linkage method. The output of this function is a structure with four outputs. The two most important outputs are merges and order. The combination of all four outputs can be plotted via sp.plot(-) function.
using ACS
h = hclust(dist1,:average) # Average linkage or centroids
Clustering.Hclust{Float64}([-1 -4; -6 -7; … ; 5 7; -9 8], [0.19357234877558568, 0.487935092734684, 0.616070565256813, 0.770684902905687, 1.3128308887362572, 1.5497567121970626, 2.6782606580142105, 3.0579322451642765, 3.5354810610149645], [9, 8, 10, 6, 7, 2, 1, 4, 3, 5], :average)
To access the outputs, one can do the following:
using ACS
h.order
10-element Vector{Int64}:
9
8
10
6
7
2
1
4
3
5
and to plot the outputs, we can use the below function.
using ACS
sp.plot(h)
There are also python implementation of HCA, that you can explore using those for your analysis.
Additional Example
If you are interested in practicing more, you can use the mtcars dataset via RDatasets provided in folder dataset of the package ACS.jl github repository. Please note that you must exclude the car origin column. The objective here is to see whether HCA is able to cluster the cars with similar origins.
If you are interested in additional resources regarding HCA and would like to know more you can check this MIT course material.