Hierarchical Cluster Analysis (HCA)

Introduction

The HCA is an unsupervised clustering approach mainly based on the distances of the measurements from each other. It is an agglomerative approach, thus starting with each individual measurement as a cluster and then grouping them to build a final cluster that includes all the measurements.

How?

The approach taken in HCA is very simple from programming point of view. The algorithm starts with the assumption that every individual measurement is a unique cluster. Then it calculates the pairwise distances between the measurements. The two measurements with the smallest distance are grouped together to form the first agglomerative cluster. In the next iteration, the newly generated cluster is then represented by either its mean, minimum or its maximum for the distance calculations. It should be noted that there are several ways to calculate the distance between two measurements (e.g. Euclidean distance and Mahalanobis distance). For simplicity, we are only going to look at the "Euclidean distance" here.

In a one dimensional space, the distance between points $x_{n}$ and $x_{m}$ is calculated by subtracting the two points from each other.

\[ d_{m,n} = x_{n} - x_{m} \]

Assuming the below dataset with vectors X and Y as the coordinates.

using ACS

# Generating the data

cx1 = 1 .+ (2-1) .* rand(5) # 5 random values between 1 and 2
c1 = 5 .* rand(5)           # 5 random values around 5
cx2 = 4 .+ (6-4) .* rand(5) # 5 random values between 4 and 6
c2 = 10 .* rand(5)          # 5 random values around 10

Y = vcat(c1[:],c2[:])       # building matrix Y
X = vcat(cx1[:],cx2[:])     # building the matrix X

# Plotting the data
scatter(cx1,c1,label = "Cluster 1")
scatter!(cx2,c2,label = "Cluster 2")
xlabel!("X")
ylabel!("Y")

We first plot the data in the one dimensional data. In other words, we are setting the y values to zero in our data matrix. Below you can see how this is done.

using ACS

# Plotting the data

scatter(X,zeros(length(X[:])),label = "Data")

xlabel!("X")
ylabel!("Y")

The next step is to calculate the distances in the x domain. For that we are using the Euclidean distances. Here we need to calculate the distance between each point in the X and all the other values in the same matrix. This implies that we will end up with a square distance matrix (i.e. dist). The dist matrix has a zero diagonal, given that the values on the diagonal represent the distance between each point and itself. Also it is important to note that we are interested only in the magnitude of the distance but not its direction. Thus, you can use the abs.(-) to convert all the distances to their absolute values. Below you can see an example of a function for these calculations.

using ACS

function dist_calc(data)

    dist = zeros(size(data,1),size(data,1))      # A square matrix is initiated
	for i = 1:size(data,1)-1                     # The nested loops create two unaligned vectors by one member
		for j = i+1:size(data,1)
			dist[j,i] = data[j,1] - data[i,1]    # The generated vectors are subtracted from each other
		end
	end

	dist += dist'                                # The upper diagonal is filled
	return abs.(dist)                            # Make sure the order of subtraction does not affect the distances

end

dist = dist_calc(X)

10×10 Matrix{Float64}:
 0.0       0.451312  0.834647   0.875858   …  3.03222   4.17414    3.56899
 0.451312  0.0       0.383335   0.424546      2.58091   3.72283    3.11767
 0.834647  0.383335  0.0        0.0412103     2.19757   3.33949    2.73434
 0.875858  0.424546  0.0412103  0.0           2.15636   3.29828    2.69313
 0.156566  0.294746  0.678081   0.719291      2.87565   4.01757    3.41242
 4.85493   4.40362   4.02028    3.97907    …  1.82271   0.680792   1.28594
 4.26558   3.81426   3.43093    3.38972       1.23336   0.0914396  0.69659
 3.03222   2.58091   2.19757    2.15636       0.0       1.14192    0.536769
 4.17414   3.72283   3.33949    3.29828       1.14192   0.0        0.605151
 3.56899   3.11767   2.73434    2.69313       0.536769  0.605151   0.0

In the next step we need to find the points in the X that have the smallest distance and should be grouped together as the first cluster. To do so we need to use the dist matrix. However, as you see in the dist matrix the smallest values are zero and are found in the diagonal. A way to deal with this is to set the diagonal to for example to Inf.

using ACS

#dist = dist_calc(X)
dist[diagind(dist)] .= Inf                   # Set the diagonal to inf, which is very helpful when searching for minimum distance
dist

10×10 Matrix{Float64}:
 Inf         0.451312   0.834647   …   3.03222    4.17414     3.56899
  0.451312  Inf         0.383335       2.58091    3.72283     3.11767
  0.834647   0.383335  Inf             2.19757    3.33949     2.73434
  0.875858   0.424546   0.0412103      2.15636    3.29828     2.69313
  0.156566   0.294746   0.678081       2.87565    4.01757     3.41242
  4.85493    4.40362    4.02028    …   1.82271    0.680792    1.28594
  4.26558    3.81426    3.43093        1.23336    0.0914396   0.69659
  3.03222    2.58091    2.19757       Inf         1.14192     0.536769
  4.17414    3.72283    3.33949        1.14192   Inf          0.605151
  3.56899    3.11767    2.73434        0.536769   0.605151   Inf

Now that we have the complete distances matrix, we can use the function argmin(-) to find the coordinates of the points with the minimum distance in the X.

using ACS

cl_temp = argmin(dist)

CartesianIndex(4, 3)

The selected points in the X are the closest to each other, indicating that they should be grouped into one cluster. In the next step, we will assume this cluster as a single point and thus we can repeat the distance calculations. For simplicity, we assume that the average of the two points is representative of that cluster. This process is called linkage and can be done using different approaches. Using the mean, in particular, is called centroid linkage. With centroid method, we are replacing these points with their average. This process can be repeated until all data points are grouped into one single cluster.

using ACS

X1 = deepcopy(X)
X1[cl_temp[1]] = mean([X[cl_temp[1]],X[cl_temp[2]]])
X1[cl_temp[2]] = mean([X[cl_temp[1]],X[cl_temp[2]]])

[X,X1]

2-element Vector{Vector{Float64}}:
 [1.0837659886864623, 1.5350780489285039, 1.9184134671253052, 1.9596238106006887, 1.2403324880031188, 5.938694674120205, 5.349342716576448, 4.115983565782216, 5.257903104089652, 4.652752589357731]
 [1.0837659886864623, 1.5350780489285039, 1.939018638862997, 1.939018638862997, 1.2403324880031188, 5.938694674120205, 5.349342716576448, 4.115983565782216, 5.257903104089652, 4.652752589357731]

using ACS


scatter(X,zeros(length(X[:])),label = "Original data")
scatter!(X1,0.01 .* ones(length(X1[:])),label = "Data after clustering",fillalpha = 0.5)
ylims!(-0.01,0.1)
xlabel!("X")
ylabel!("Y")

So far, we have done all our calculations based on one dimensional data. Now we can move towards two and more dimension. One of the main things to consider when increasing the number of dimensions is the distance calculations. In the above examples, we have use the Euclidean distance, which is one of many distance metrics. In general terms the Euclidean distance can be expressed as below, where $d_{m,n}$ represents the distance between points m and n. This is based on the Pythagorean distance.

\[d_{m,n} = \sqrt{\sum{(x_{m} - x_{n})^{2}}}. \]

Let's try to move to a two dimensional space rather than uni-dimensional one.

using ACS

# Plotting the data
scatter(cx1,c1,label = "Cluster 1")
scatter!(cx2,c2,label = "Cluster 2")
xlabel!("X")
ylabel!("Y")

The very first step to do so is to convert our 1D distances to 2D ones, using the below equation. If we replace the $dist[j,i] = data[j,1] - data[i,1]$ with the below equation in our distance function, we will be able to generate the distance matrix for our two dimensional dataset.

\[d_{m,n} = \sqrt{(x_{m} - x_{n})^{2} + (y_{m} - y_{n})^{2}}. \]

using ACS

function dist_calc(data)

    dist = zeros(size(data,1),size(data,1))      # A square matrix is initiated
	for i = 1:size(data,1)-1                     # The nested loops create two unaligned vectors by one member
		for j = i+1:size(data,1)
			dist[j,i] = sqrt(sum((data[i,:] .- data[j,:]).^2))    # The generated vectors are subtracted from each other
		end
	end

	dist += dist'                                # The upper diagonal is filled
	return abs.(dist)                            # Make sure the order of subtraction does not affect the distances

end

data = hcat(X,Y)								# To generate the DATA matrix

dist = dist_calc(data) 							# Calculating the distance matrix
dist[diagind(dist)] .= Inf                   # Set the diagonal to inf, which is very helpful when searching for minimum distance
dist

10×10 Matrix{Float64}:
 Inf         0.621856   1.22398    0.980196  …   3.30324   4.47493   3.69579
  0.621856  Inf         0.604529   0.424723      3.11165   3.90691   3.41252
  1.22398    0.604529  Inf         0.457058      3.11354   3.41573   3.30421
  0.980196   0.424723   0.457058  Inf            2.77739   3.50061   3.03521
  0.413532   0.862485   1.44676    1.09289       3.02157   4.48593   3.46086
  5.34165    5.14236    5.09076    4.79064   …   2.04055   3.90055   1.80593
  5.02583    4.90607    4.93917    4.59204       1.82666   4.27169   1.83533
  3.30324    3.11165    3.11354    2.77739      Inf        3.13842   0.64112
  4.47493    3.90691    3.41573    3.50061       3.13842  Inf        2.64292
  3.69579    3.41252    3.30421    3.03521       0.64112   2.64292  Inf

using ACS

cl_temp = argmin(dist)

data1 = deepcopy(data)

data1[cl_temp[1],1] = mean([data[cl_temp[1],1],data[cl_temp[2],1]])
data1[cl_temp[1],2] = mean([data[cl_temp[1],2],data[cl_temp[2],2]])
data1[cl_temp[2],1] = mean([data[cl_temp[1],1],data[cl_temp[2],1]])
data1[cl_temp[2],2] = mean([data[cl_temp[1],2],data[cl_temp[2],2]])

data1

10×2 Matrix{Float64}:
 1.16205  2.49727
 1.53508  3.11645
 1.91841  3.5839
 1.95962  3.12871
 1.16205  2.49727
 5.93869  0.460892
 5.34934  0.0308567
 4.11598  1.37827
 5.2579   4.30157
 4.65275  1.72886

using ACS

scatter(data[:,1],data[:,2],label = "Original data")
scatter!(data1[:,1],data1[:,2],label = "Data after clustering")
xlabel!("X")
ylabel!("Y")

As it can be seen in the figure, there are two blue points and one red point in the middle of those points. These blue dots represent the two closest data points that are clustered together to form the centroid in between them. If we repeat this process multiple times, we eventually end up having all data points into one large cluster. The HCA clustering generates an array of clustered data points that can be visualized via a dendrogram or a heatmap.

using ACS

dist = dist_calc(data)

hc = hclust(dist, linkage=:average)
sp.plot(hc)

Practical Application

We can use either our home developed function for HCA or use the julia built-in functions. Here we will provide an easy tutorial on how to use the julia functions that are built-in the ACS.jl package.

For calculating the distances the function pairwise(-) via the julia package Distances.jl can be used. Function pairwise(-) has three inputs namely: 1) distance metrics, 2) data, and 3) the operation direction. This function outputs a square matrix similar to our distance matrix. As it can be seen from the distance matrix, our function and the pairwise(-) generate the same results, which is expected. The function pairwise(-) will give access to a wide variety of distance metrics that can be used for your projects.

using ACS

dist1 = pairwise(ACS.Euclidean(), data, dims=1) # Euclidean distance

# dist1 = pairwise(ACS.TotalVariation(), data, dims=1) # TotalVariation distance

10×10 Matrix{Float64}:
 0.0       0.621856  1.22398   …  5.02583   3.30324  4.47493  3.69579
 0.621856  0.0       0.604529     4.90607   3.11165  3.90691  3.41252
 1.22398   0.604529  0.0          4.93917   3.11354  3.41573  3.30421
 0.980196  0.424723  0.457058     4.59204   2.77739  3.50061  3.03521
 0.413532  0.862485  1.44676      4.69678   3.02157  4.48593  3.46086
 5.34165   5.14236   5.09076   …  0.729566  2.04055  3.90055  1.80593
 5.02583   4.90607   4.93917      0.0       1.82666  4.27169  1.83533
 3.30324   3.11165   3.11354      1.82666   0.0      3.13842  0.64112
 4.47493   3.90691   3.41573      4.27169   3.13842  0.0      2.64292
 3.69579   3.41252   3.30421      1.83533   0.64112  2.64292  0.0

For the HCA itself, you can use the function hclust(-) incorporated in the ACS.jl package and provided via Clustering.jl package. This function takes two inputs, the distance matrix and the linkage method. The output of this function is a structure with four outputs. The two most important outputs are merges and order. The combination of all four outputs can be plotted via sp.plot(-) function.

using ACS

h = hclust(dist1,:average) # Average linkage or centroids

Clustering.Hclust{Float64}([-1 -5; -2 -4; … ; -9 7; 6 8], [0.4135316701781886, 0.4247225664087117, 0.5307938702463397, 0.6411201538574205, 0.7295657553454891, 1.0380267405091588, 1.8771190143009902, 3.488395180289137, 4.063721390532834], [1, 5, 3, 2, 4, 9, 8, 10, 6, 7], :average)

To access the outputs, one can do the following:

using ACS

h.order

10-element Vector{Int64}:
  1
  5
  3
  2
  4
  9
  8
 10
  6
  7

and to plot the outputs, we can use the below function.

using ACS

sp.plot(h)

There are also python implementation of HCA, that you can explore using those for your analysis.

Additional Example

If you are interested in practicing more, you can use the mtcars dataset via RDatasets provided in folder dataset of the package ACS.jl github repository. Please note that you must exclude the car origin column. The objective here is to see whether HCA is able to cluster the cars with similar origins.

If you are interested in additional resources regarding HCA and would like to know more you can check this MIT course material.