𝐈𝐧𝐟𝐢𝐧𝐢𝐭𝐲 𝐂𝐒

❤2

300 views16:10

164 views16:02

Article 23: Hierarchical Clustering – Building the Tree of Data 🌳🌲

Hierarchical clustering is finding groups by building a hierarchy. Unlike K-Means, we do not need to choose the number of groups (K) at the beginning. In this, creates a tree of data called a Dendrogram.

1. The Core Logic: Agglomerative Clustering 🏗

Most people use the Agglomerative (Bottom-Up) method for this,

● Every data point starts as its own small cluster.
● The machine finds the two clusters that are closest together.
● The machine joins (merges) them into one new cluster.
● The machine updates the distance between the new cluster and all other clusters.
● It repeats this until all data is in one big cluster.

2. Linkage Criteria: The Math of Merging 📏🧮

Linkage is a method used in hierarchical clustering to define how the distance between two clusters is computed. It is based on the distances between the data points in those clusters. Instead of measuring the distance between individual points, linkage tells the algorithm to how to measure the distance between groups of points (clusters).

I. Single Linkage (Minimum Distance)
It measures the distance between the two closest points in two clusters. It can create long and thin clusters. We call it as the Chaining Effect.

II. Complete Linkage (Maximum Distance)
It measures the distance between the two furthest points in two clusters. It avoids chaining and creates compact and round clusters.

III. Average Linkage
It calculates the average distance between all pairs of points in two clusters.

IV. Ward’s Method
It does not just look at distance. It looks at the variance also. Ward's joins two clusters only if the total within-cluster variance stays as small as possible. It will create very clear and equally sized clusters. It is the mathematically strongest one for general data.

3. The Dendrogram Analysis 📊✂️

The dendrogram is a visual representation of the hierarchical clustering process. It is showing how clusters are formed step by step.

● The vertical axis (height) represents the distance or dissimilarity at which clusters merge.
● Clusters that merge at lower heights are more similar than clusters that merge at higher heights.

To decide the number of clusters, 🎯

● Identify the largest vertical gap in the dendrogram (a region with a big jump in height where no merges occur).
● Draw a horizontal line across the dendrogram within this gap.
● Count the number of vertical branches intersected by the line.

Now, the number of branches crossed by the horizontal cut is the value of K.

4. Cophenetic Correlation and Performance 🧪⏳

We are using the Cophenetic Correlation Coefficient (c) to prove the tree's accuracy. It measures the correlation between the original distances of the data points and the distances where they join in the Dendrogram. If c > 0.75, tree is a good representation of the data. Hierarchical clustering is heavy for computers. Time complexity is O(n^2 log n) or O(n^3). It requires O(n^2) space to store the distance matrix. This means it is very slow for millions of rows. ✅💾

Summary 📝

Hierarchical Clustering helps to see the structure of data like a family tree. Use Ward’s Method for the best groups and the Dendrogram to pick the K value. Always check the Cophenetic Correlation to ensure the results are correct. 🌟 🙊😁

More: Link

✍️ @TheInfinityAI

Infinity CS

❤3

3.07K viewsedited 16:12