Article 21: Non-linear Dimensionality Reduction – Mathematical Analysis of t-SNE and UMAP 🌀📉
අපි කලින් ඉගන ගත්ත PCA වලින් දත්ත වල Linear සම්බන්ධතා හඳුනාගත්තත් දත්ත ඉතාමත් සංකීර්ණ වක්ර (Manifolds) විදියට විසිරිලා තියෙනකොට PCA අසාර්ථක වෙනවා. එතකොට අපි පාවිච්චි කරන්නේ දත්තවල Local Structure එක ආරක්ෂා කරන ඊට වඩා උසස් algorithms.
01. t-SNE (t-Distributed Stochastic Neighbor Embedding) 🧩
t-SNE කියන්නේ data points අතර තියෙන Probabilities මත පදනම් වෙලා ක්රියා කරන algorithm එකක්.
I. High-Dimensional Similarity (Gaussian Kernel)
ඉහළ මානවලදි, ලක්ෂ්ය xⱼ ලක්ෂ්ය xᵢ වල අසල්වැසියෙකු වීමේ සම්භාවිතාව (p₍ⱼ∣ᵢ₎) ගණනය කරන්නේ Gaussian Distribution එකකින්.
p₍ⱼ∣ᵢ₎ = ((exp(− ∥xᵢ − xⱼ∥²/2σᵢ²)) / (∑₍ₖ≠ᵢ₎exp(− ∥xᵢ − xₖ∥²/2σᵢ²)))
II. Low-Dimensional Similarity (Student's t-Distribution)
පහළ මානයේදී (2D), අපි Gaussian වෙනුවට Student's t-distribution (with 1 degree of freedom) පාවිච්චි කරනවා. මේකට හේතුව තමා Crowding Problem (ඉහළ මානයේ සිට පහළට එද්දී දත්ත ලක්ෂ්ය එකම තැනක ගොඩ ගැසීම) වළක්වා ගැනීමට t-distribution එකේ ඇති Heavy Tail එක උදව් වීම.
qᵢⱼ = (((1 + ∥yᵢ − yⱼ∥²)⁻¹) / (∑₍ₖ≠ₗ₎(1 + ∥yₖ − yₗ∥²)⁻¹))
III. The Cost Function: Kullback-Leibler (KL) Divergence
t-SNE වල ඉලක්කය වෙන්නෙ P ව්යාප්තිය සහ Q ව්යාප්තිය අතර ඇති වෙනස අවම කරන එක. මෙහිදී Gradient Descent පාවිච්චි කරන්නේ මේ KL Divergence එක අඩු කරන්න.
C = KL(P∣∣Q) = ∑ᵢ∑ⱼpᵢⱼlog ((pᵢⱼ) / (qᵢⱼ))
02. UMAP (Uniform Manifold Approximation and Projection) 🕸
UMAP කියන්නේ t-SNE වලට වඩා වේගවත් වගේම දත්තවල Global Structure එක වඩා හොඳින් ආරක්ෂා කරන අලුත් තාක්ෂණයක්. මේකේ පදනම වැටෙන්නේ Riemannian Geometry සහ Algebraic Topology මත.
I. Topological Structure Construction
UMAP වලින් දත්ත විසිරිලා තියෙන ආකාරය ඇසුරෙන් Fuzzy Simplicial Complex එකක් නිර්මාණය කරනවා. මේකෙදි එක් එක් ලක්ෂ්යය වටා ඇති ප්රදේශය (Neighborhood) Locally Uniform බව උපකල්පනය කරනවා.
II. Optimization (Cross Entropy)
t-SNE වලදී KL Divergence පාවිච්චි කරද්දි UMAP විසින් Binary Cross Entropy පාවිච්චි කරමින් ඉහළ මානයේ සහ පහළ මානයේ තියෙන ජ්යාමිතික ව්යුහයන් අතර සම්බන්ධතාවය ප්රශස්ත කරනව.
∑₍ₑ∈E₎wₕ(e)log (((wₕ(e)) / (wₗ(e))))+ (1 − wₕ(e))log (((1 − wₕ(e)) / (1 − wₗ(e))))
මේකේ wₕ සහ wₗ කියන්නේ ඉහළ සහ පහළ මානයන්හි ඇති edge weights.
03. Advanced Implementation Details 🚀
t-SNE සාමාන්යයෙන් අහඹු ලෙස (Random) ආරම්භ කරන හින්දා UMAP වලින් Graph Laplacian හරි PCA පාවිච්චි කර වඩා හොඳ ආරම්භයක් ලබා ගන්නවා. t-SNE හි වේගය O(N^2) හෝ O(N log N) (Barnes-Hut) වෙද්දී UMAP හි වේගය O(N^{1.14}) වගේ ඉතා අඩු අගයක්. ඒ හින්දා විශාල දත්ත ගොනු වලට UMAP අතිශය සාර්ථකයි. 🎓✅
අපි අද කතා කලේ Dimensionality Reduction වල තවත් කොටසක්. ඊලග ආටිකල් එකෙන් (Article 22) Gaussian Mixture Models (GMM) සහ EM Algorithm ගැන කතා කරමු. 🧠⏭️ 🙊😁
✍️ @TheInfinityAI
Telegram
Infinity CS
❤3
progit.pdf
18 MB
Pro Git (Second Edition) - Everything you need to know about Git 📕🔧
A complete and practical guide to Git version control—from basics to advanced workflows. Ideal for developers at any level.
📌 Credit:
Shared for educational purposes only. All rights belong to the authors and publisher.
Happy learning & keep growing! 💡📖
✍️ @TheInfinityAI
A complete and practical guide to Git version control—from basics to advanced workflows. Ideal for developers at any level.
👨💻 Authors: Scott Chacon & Ben Straub
🏢 Publisher: Apress
📚 Online Git Book: Link
📌 Credit:
Shared for educational purposes only. All rights belong to the authors and publisher.
Happy learning & keep growing! 💡📖
✍️ @TheInfinityAI
❤3🐳1
Article 22: Gaussian Mixture Models (GMM) and EM Algorithm 🎲📈
In the last article, we studied K-Means. K-Means is Hard Clustering. Its mean one data point belongs to only one group. But in the real world, data is not always clear. A data point can have two groups at the same time. Gaussian Mixture Models (GMM) help us to solve this. GMM is Soft Clustering. It gives a probability for each group.
1. What is a Gaussian Distribution? 🔔
GMM assumes the data comes from many Gaussian Distributions. We also call it as normal distribution or bell curve. Each group (cluster) has its own shape. To define a Gaussian shape, we need two things. Mean (μ) and Covariance (Σ). The mean is the center of the curve and the covariance is the width and direction of the curve.
𝓝(x∣μ,Σ) = (1 / ((2π)⁽ᴰ/²⁾∣Σ∣⁽¹/²⁾))exp (−½(x − μ)ᵀΣ⁻¹(x − μ))
2. How GMM Works: The EM Algorithm 🔄
We dont know the center (μ) or the width (Σ) of the groups at the start. So we are using the Expectation-Maximization (EM) Algorithm. It works in a loop with two main steps.
Step 1: The E-Step (Expectation) 🧐
In this step, machine calculates the responsibility. It asks, "what is the chance that this data point belongs to group A, group B, or group C?" We are using this formula to find the chance (γ),
γ(zₙₖ) = ((πₖ𝓝(xₙ∣μₖ,Σₖ)) / (∑ⱼᴷ₌₁πⱼ𝓝(xₙ∣μⱼ,Σⱼ)))
Step 2: The M-Step (Maximization) 🛠
Now the machine uses the chances from the E-Step to update the groups. It changes the center and the width to fit the data better. It tries to make the Log-Likelihood (the total fit) as high as possible.
3. Why is GMM better than K-Means? 🤔🏆
4. How to find the number of groups? (AIC and BIC) 📊
In K-Means, we are using the Elbow Method but in GMM, we are using AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
Therefore, we are choosing the number of groups that gives the lowest BIC score.
Summary 📝
GMM is a powerful tool for finding groups using probability. It uses the EM Algorithm to find the best centers and shapes for the data. Unlike K-Means it is very flexible and works well when groups overlap or have different shapes. In Article 23, we will discuss Hierarchical Clustering (building a tree of data). 🌳⏭️ 🙊😁
✍️ @TheInfinityAI
In the last article, we studied K-Means. K-Means is Hard Clustering. Its mean one data point belongs to only one group. But in the real world, data is not always clear. A data point can have two groups at the same time. Gaussian Mixture Models (GMM) help us to solve this. GMM is Soft Clustering. It gives a probability for each group.
1. What is a Gaussian Distribution? 🔔
GMM assumes the data comes from many Gaussian Distributions. We also call it as normal distribution or bell curve. Each group (cluster) has its own shape. To define a Gaussian shape, we need two things. Mean (μ) and Covariance (Σ). The mean is the center of the curve and the covariance is the width and direction of the curve.
𝓝(x∣μ,Σ) = (1 / ((2π)⁽ᴰ/²⁾∣Σ∣⁽¹/²⁾))exp (−½(x − μ)ᵀΣ⁻¹(x − μ))
2. How GMM Works: The EM Algorithm 🔄
We dont know the center (μ) or the width (Σ) of the groups at the start. So we are using the Expectation-Maximization (EM) Algorithm. It works in a loop with two main steps.
Step 1: The E-Step (Expectation) 🧐
In this step, machine calculates the responsibility. It asks, "what is the chance that this data point belongs to group A, group B, or group C?" We are using this formula to find the chance (γ),
γ(zₙₖ) = ((πₖ𝓝(xₙ∣μₖ,Σₖ)) / (∑ⱼᴷ₌₁πⱼ𝓝(xₙ∣μⱼ,Σⱼ)))
Step 2: The M-Step (Maximization) 🛠
Now the machine uses the chances from the E-Step to update the groups. It changes the center and the width to fit the data better. It tries to make the Log-Likelihood (the total fit) as high as possible.
3. Why is GMM better than K-Means? 🤔🏆
● Soft Assignment - we can get a percentage for each group. This is more useful for complex data.
● Flexible Shapes - K-Means only finds circles but GMM can find ellipses (oval shapes) that point in any direction.
● Covariance Types - we can choose different settings for the shape like spherical, diagonal and full.
4. How to find the number of groups? (AIC and BIC) 📊
In K-Means, we are using the Elbow Method but in GMM, we are using AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
● These are scores that tell us if the model is good.
● If we add too many groups, the model becomes too complex (Overfitting).
● AIC and BIC give penalties for complexity.
Therefore, we are choosing the number of groups that gives the lowest BIC score.
Summary 📝
GMM is a powerful tool for finding groups using probability. It uses the EM Algorithm to find the best centers and shapes for the data. Unlike K-Means it is very flexible and works well when groups overlap or have different shapes. In Article 23, we will discuss Hierarchical Clustering (building a tree of data). 🌳⏭️ 🙊😁
✍️ @TheInfinityAI
Telegram
Infinity CS
Gaussian Mixture Models (GMM) and EM Algorithm 🎲
❤2
Article 23: Hierarchical Clustering – Building the Tree of Data 🌳🌲
Hierarchical clustering is finding groups by building a hierarchy. Unlike K-Means, we do not need to choose the number of groups (K) at the beginning. In this, creates a tree of data called a Dendrogram.
1. The Core Logic: Agglomerative Clustering 🏗
Most people use the Agglomerative (Bottom-Up) method for this,
2. Linkage Criteria: The Math of Merging 📏🧮
Linkage is a method used in hierarchical clustering to define how the distance between two clusters is computed. It is based on the distances between the data points in those clusters. Instead of measuring the distance between individual points, linkage tells the algorithm to how to measure the distance between groups of points (clusters).
I. Single Linkage (Minimum Distance)
It measures the distance between the two closest points in two clusters. It can create long and thin clusters. We call it as the Chaining Effect.
II. Complete Linkage (Maximum Distance)
It measures the distance between the two furthest points in two clusters. It avoids chaining and creates compact and round clusters.
III. Average Linkage
It calculates the average distance between all pairs of points in two clusters.
IV. Ward’s Method
It does not just look at distance. It looks at the variance also. Ward's joins two clusters only if the total within-cluster variance stays as small as possible. It will create very clear and equally sized clusters. It is the mathematically strongest one for general data.
3. The Dendrogram Analysis 📊✂️
The dendrogram is a visual representation of the hierarchical clustering process. It is showing how clusters are formed step by step.
● The vertical axis (height) represents the distance or dissimilarity at which clusters merge.
● Clusters that merge at lower heights are more similar than clusters that merge at higher heights.
To decide the number of clusters, 🎯
Now, the number of branches crossed by the horizontal cut is the value of K.
4. Cophenetic Correlation and Performance 🧪⏳
We are using the Cophenetic Correlation Coefficient (c) to prove the tree's accuracy. It measures the correlation between the original distances of the data points and the distances where they join in the Dendrogram. If c > 0.75, tree is a good representation of the data. Hierarchical clustering is heavy for computers. Time complexity is O(n^2 log n) or O(n^3). It requires O(n^2) space to store the distance matrix. This means it is very slow for millions of rows. ✅💾
Summary 📝
Hierarchical Clustering helps to see the structure of data like a family tree. Use Ward’s Method for the best groups and the Dendrogram to pick the K value. Always check the Cophenetic Correlation to ensure the results are correct. 🌟 🙊😁
More: Link
✍️ @TheInfinityAI
Hierarchical clustering is finding groups by building a hierarchy. Unlike K-Means, we do not need to choose the number of groups (K) at the beginning. In this, creates a tree of data called a Dendrogram.
1. The Core Logic: Agglomerative Clustering 🏗
Most people use the Agglomerative (Bottom-Up) method for this,
● Every data point starts as its own small cluster.
● The machine finds the two clusters that are closest together.
● The machine joins (merges) them into one new cluster.
● The machine updates the distance between the new cluster and all other clusters.
● It repeats this until all data is in one big cluster.
2. Linkage Criteria: The Math of Merging 📏🧮
Linkage is a method used in hierarchical clustering to define how the distance between two clusters is computed. It is based on the distances between the data points in those clusters. Instead of measuring the distance between individual points, linkage tells the algorithm to how to measure the distance between groups of points (clusters).
I. Single Linkage (Minimum Distance)
It measures the distance between the two closest points in two clusters. It can create long and thin clusters. We call it as the Chaining Effect.
II. Complete Linkage (Maximum Distance)
It measures the distance between the two furthest points in two clusters. It avoids chaining and creates compact and round clusters.
III. Average Linkage
It calculates the average distance between all pairs of points in two clusters.
IV. Ward’s Method
It does not just look at distance. It looks at the variance also. Ward's joins two clusters only if the total within-cluster variance stays as small as possible. It will create very clear and equally sized clusters. It is the mathematically strongest one for general data.
3. The Dendrogram Analysis 📊✂️
The dendrogram is a visual representation of the hierarchical clustering process. It is showing how clusters are formed step by step.
● The vertical axis (height) represents the distance or dissimilarity at which clusters merge.
● Clusters that merge at lower heights are more similar than clusters that merge at higher heights.
To decide the number of clusters, 🎯
● Identify the largest vertical gap in the dendrogram (a region with a big jump in height where no merges occur).
● Draw a horizontal line across the dendrogram within this gap.
● Count the number of vertical branches intersected by the line.
Now, the number of branches crossed by the horizontal cut is the value of K.
4. Cophenetic Correlation and Performance 🧪⏳
We are using the Cophenetic Correlation Coefficient (c) to prove the tree's accuracy. It measures the correlation between the original distances of the data points and the distances where they join in the Dendrogram. If c > 0.75, tree is a good representation of the data. Hierarchical clustering is heavy for computers. Time complexity is O(n^2 log n) or O(n^3). It requires O(n^2) space to store the distance matrix. This means it is very slow for millions of rows. ✅💾
Summary 📝
Hierarchical Clustering helps to see the structure of data like a family tree. Use Ward’s Method for the best groups and the Dendrogram to pick the K value. Always check the Cophenetic Correlation to ensure the results are correct. 🌟 🙊😁
More: Link
✍️ @TheInfinityAI
Telegram
Infinity CS
❤3