Article 21: Non-linear Dimensionality Reduction – Mathematical Analysis of t-SNE and UMAP 🌀📉
අපි කලින් ඉගන ගත්ත PCA වලින් දත්ත වල Linear සම්බන්ධතා හඳුනාගත්තත් දත්ත ඉතාමත් සංකීර්ණ වක්ර (Manifolds) විදියට විසිරිලා තියෙනකොට PCA අසාර්ථක වෙනවා. එතකොට අපි පාවිච්චි කරන්නේ දත්තවල Local Structure එක ආරක්ෂා කරන ඊට වඩා උසස් algorithms.
01. t-SNE (t-Distributed Stochastic Neighbor Embedding) 🧩
t-SNE කියන්නේ data points අතර තියෙන Probabilities මත පදනම් වෙලා ක්රියා කරන algorithm එකක්.
I. High-Dimensional Similarity (Gaussian Kernel)
ඉහළ මානවලදි, ලක්ෂ්ය xⱼ ලක්ෂ්ය xᵢ වල අසල්වැසියෙකු වීමේ සම්භාවිතාව (p₍ⱼ∣ᵢ₎) ගණනය කරන්නේ Gaussian Distribution එකකින්.
p₍ⱼ∣ᵢ₎ = ((exp(− ∥xᵢ − xⱼ∥²/2σᵢ²)) / (∑₍ₖ≠ᵢ₎exp(− ∥xᵢ − xₖ∥²/2σᵢ²)))
II. Low-Dimensional Similarity (Student's t-Distribution)
පහළ මානයේදී (2D), අපි Gaussian වෙනුවට Student's t-distribution (with 1 degree of freedom) පාවිච්චි කරනවා. මේකට හේතුව තමා Crowding Problem (ඉහළ මානයේ සිට පහළට එද්දී දත්ත ලක්ෂ්ය එකම තැනක ගොඩ ගැසීම) වළක්වා ගැනීමට t-distribution එකේ ඇති Heavy Tail එක උදව් වීම.
qᵢⱼ = (((1 + ∥yᵢ − yⱼ∥²)⁻¹) / (∑₍ₖ≠ₗ₎(1 + ∥yₖ − yₗ∥²)⁻¹))
III. The Cost Function: Kullback-Leibler (KL) Divergence
t-SNE වල ඉලක්කය වෙන්නෙ P ව්යාප්තිය සහ Q ව්යාප්තිය අතර ඇති වෙනස අවම කරන එක. මෙහිදී Gradient Descent පාවිච්චි කරන්නේ මේ KL Divergence එක අඩු කරන්න.
C = KL(P∣∣Q) = ∑ᵢ∑ⱼpᵢⱼlog ((pᵢⱼ) / (qᵢⱼ))
02. UMAP (Uniform Manifold Approximation and Projection) 🕸
UMAP කියන්නේ t-SNE වලට වඩා වේගවත් වගේම දත්තවල Global Structure එක වඩා හොඳින් ආරක්ෂා කරන අලුත් තාක්ෂණයක්. මේකේ පදනම වැටෙන්නේ Riemannian Geometry සහ Algebraic Topology මත.
I. Topological Structure Construction
UMAP වලින් දත්ත විසිරිලා තියෙන ආකාරය ඇසුරෙන් Fuzzy Simplicial Complex එකක් නිර්මාණය කරනවා. මේකෙදි එක් එක් ලක්ෂ්යය වටා ඇති ප්රදේශය (Neighborhood) Locally Uniform බව උපකල්පනය කරනවා.
II. Optimization (Cross Entropy)
t-SNE වලදී KL Divergence පාවිච්චි කරද්දි UMAP විසින් Binary Cross Entropy පාවිච්චි කරමින් ඉහළ මානයේ සහ පහළ මානයේ තියෙන ජ්යාමිතික ව්යුහයන් අතර සම්බන්ධතාවය ප්රශස්ත කරනව.
∑₍ₑ∈E₎wₕ(e)log (((wₕ(e)) / (wₗ(e))))+ (1 − wₕ(e))log (((1 − wₕ(e)) / (1 − wₗ(e))))
මේකේ wₕ සහ wₗ කියන්නේ ඉහළ සහ පහළ මානයන්හි ඇති edge weights.
03. Advanced Implementation Details 🚀
t-SNE සාමාන්යයෙන් අහඹු ලෙස (Random) ආරම්භ කරන හින්දා UMAP වලින් Graph Laplacian හරි PCA පාවිච්චි කර වඩා හොඳ ආරම්භයක් ලබා ගන්නවා. t-SNE හි වේගය O(N^2) හෝ O(N log N) (Barnes-Hut) වෙද්දී UMAP හි වේගය O(N^{1.14}) වගේ ඉතා අඩු අගයක්. ඒ හින්දා විශාල දත්ත ගොනු වලට UMAP අතිශය සාර්ථකයි. 🎓✅
අපි අද කතා කලේ Dimensionality Reduction වල තවත් කොටසක්. ඊලග ආටිකල් එකෙන් (Article 22) Gaussian Mixture Models (GMM) සහ EM Algorithm ගැන කතා කරමු. 🧠⏭️ 🙊😁
✍️ @TheInfinityAI
Telegram
Infinity CS
❤3
progit.pdf
18 MB
Pro Git (Second Edition) - Everything you need to know about Git 📕🔧
A complete and practical guide to Git version control—from basics to advanced workflows. Ideal for developers at any level.
📌 Credit:
Shared for educational purposes only. All rights belong to the authors and publisher.
Happy learning & keep growing! 💡📖
✍️ @TheInfinityAI
A complete and practical guide to Git version control—from basics to advanced workflows. Ideal for developers at any level.
👨💻 Authors: Scott Chacon & Ben Straub
🏢 Publisher: Apress
📚 Online Git Book: Link
📌 Credit:
Shared for educational purposes only. All rights belong to the authors and publisher.
Happy learning & keep growing! 💡📖
✍️ @TheInfinityAI
❤3🐳1
Article 22: Gaussian Mixture Models (GMM) and EM Algorithm 🎲📈
In the last article, we studied K-Means. K-Means is Hard Clustering. Its mean one data point belongs to only one group. But in the real world, data is not always clear. A data point can have two groups at the same time. Gaussian Mixture Models (GMM) help us to solve this. GMM is Soft Clustering. It gives a probability for each group.
1. What is a Gaussian Distribution? 🔔
GMM assumes the data comes from many Gaussian Distributions. We also call it as normal distribution or bell curve. Each group (cluster) has its own shape. To define a Gaussian shape, we need two things. Mean (μ) and Covariance (Σ). The mean is the center of the curve and the covariance is the width and direction of the curve.
𝓝(x∣μ,Σ) = (1 / ((2π)⁽ᴰ/²⁾∣Σ∣⁽¹/²⁾))exp (−½(x − μ)ᵀΣ⁻¹(x − μ))
2. How GMM Works: The EM Algorithm 🔄
We dont know the center (μ) or the width (Σ) of the groups at the start. So we are using the Expectation-Maximization (EM) Algorithm. It works in a loop with two main steps.
Step 1: The E-Step (Expectation) 🧐
In this step, machine calculates the responsibility. It asks, "what is the chance that this data point belongs to group A, group B, or group C?" We are using this formula to find the chance (γ),
γ(zₙₖ) = ((πₖ𝓝(xₙ∣μₖ,Σₖ)) / (∑ⱼᴷ₌₁πⱼ𝓝(xₙ∣μⱼ,Σⱼ)))
Step 2: The M-Step (Maximization) 🛠
Now the machine uses the chances from the E-Step to update the groups. It changes the center and the width to fit the data better. It tries to make the Log-Likelihood (the total fit) as high as possible.
3. Why is GMM better than K-Means? 🤔🏆
4. How to find the number of groups? (AIC and BIC) 📊
In K-Means, we are using the Elbow Method but in GMM, we are using AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
Therefore, we are choosing the number of groups that gives the lowest BIC score.
Summary 📝
GMM is a powerful tool for finding groups using probability. It uses the EM Algorithm to find the best centers and shapes for the data. Unlike K-Means it is very flexible and works well when groups overlap or have different shapes. In Article 23, we will discuss Hierarchical Clustering (building a tree of data). 🌳⏭️ 🙊😁
✍️ @TheInfinityAI
In the last article, we studied K-Means. K-Means is Hard Clustering. Its mean one data point belongs to only one group. But in the real world, data is not always clear. A data point can have two groups at the same time. Gaussian Mixture Models (GMM) help us to solve this. GMM is Soft Clustering. It gives a probability for each group.
1. What is a Gaussian Distribution? 🔔
GMM assumes the data comes from many Gaussian Distributions. We also call it as normal distribution or bell curve. Each group (cluster) has its own shape. To define a Gaussian shape, we need two things. Mean (μ) and Covariance (Σ). The mean is the center of the curve and the covariance is the width and direction of the curve.
𝓝(x∣μ,Σ) = (1 / ((2π)⁽ᴰ/²⁾∣Σ∣⁽¹/²⁾))exp (−½(x − μ)ᵀΣ⁻¹(x − μ))
2. How GMM Works: The EM Algorithm 🔄
We dont know the center (μ) or the width (Σ) of the groups at the start. So we are using the Expectation-Maximization (EM) Algorithm. It works in a loop with two main steps.
Step 1: The E-Step (Expectation) 🧐
In this step, machine calculates the responsibility. It asks, "what is the chance that this data point belongs to group A, group B, or group C?" We are using this formula to find the chance (γ),
γ(zₙₖ) = ((πₖ𝓝(xₙ∣μₖ,Σₖ)) / (∑ⱼᴷ₌₁πⱼ𝓝(xₙ∣μⱼ,Σⱼ)))
Step 2: The M-Step (Maximization) 🛠
Now the machine uses the chances from the E-Step to update the groups. It changes the center and the width to fit the data better. It tries to make the Log-Likelihood (the total fit) as high as possible.
3. Why is GMM better than K-Means? 🤔🏆
● Soft Assignment - we can get a percentage for each group. This is more useful for complex data.
● Flexible Shapes - K-Means only finds circles but GMM can find ellipses (oval shapes) that point in any direction.
● Covariance Types - we can choose different settings for the shape like spherical, diagonal and full.
4. How to find the number of groups? (AIC and BIC) 📊
In K-Means, we are using the Elbow Method but in GMM, we are using AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion).
● These are scores that tell us if the model is good.
● If we add too many groups, the model becomes too complex (Overfitting).
● AIC and BIC give penalties for complexity.
Therefore, we are choosing the number of groups that gives the lowest BIC score.
Summary 📝
GMM is a powerful tool for finding groups using probability. It uses the EM Algorithm to find the best centers and shapes for the data. Unlike K-Means it is very flexible and works well when groups overlap or have different shapes. In Article 23, we will discuss Hierarchical Clustering (building a tree of data). 🌳⏭️ 🙊😁
✍️ @TheInfinityAI
Telegram
Infinity CS
Gaussian Mixture Models (GMM) and EM Algorithm 🎲
❤2
Article 23: Hierarchical Clustering – Building the Tree of Data 🌳🌲
Hierarchical clustering is finding groups by building a hierarchy. Unlike K-Means, we do not need to choose the number of groups (K) at the beginning. In this, creates a tree of data called a Dendrogram.
1. The Core Logic: Agglomerative Clustering 🏗
Most people use the Agglomerative (Bottom-Up) method for this,
2. Linkage Criteria: The Math of Merging 📏🧮
Linkage is a method used in hierarchical clustering to define how the distance between two clusters is computed. It is based on the distances between the data points in those clusters. Instead of measuring the distance between individual points, linkage tells the algorithm to how to measure the distance between groups of points (clusters).
I. Single Linkage (Minimum Distance)
It measures the distance between the two closest points in two clusters. It can create long and thin clusters. We call it as the Chaining Effect.
II. Complete Linkage (Maximum Distance)
It measures the distance between the two furthest points in two clusters. It avoids chaining and creates compact and round clusters.
III. Average Linkage
It calculates the average distance between all pairs of points in two clusters.
IV. Ward’s Method
It does not just look at distance. It looks at the variance also. Ward's joins two clusters only if the total within-cluster variance stays as small as possible. It will create very clear and equally sized clusters. It is the mathematically strongest one for general data.
3. The Dendrogram Analysis 📊✂️
The dendrogram is a visual representation of the hierarchical clustering process. It is showing how clusters are formed step by step.
● The vertical axis (height) represents the distance or dissimilarity at which clusters merge.
● Clusters that merge at lower heights are more similar than clusters that merge at higher heights.
To decide the number of clusters, 🎯
Now, the number of branches crossed by the horizontal cut is the value of K.
4. Cophenetic Correlation and Performance 🧪⏳
We are using the Cophenetic Correlation Coefficient (c) to prove the tree's accuracy. It measures the correlation between the original distances of the data points and the distances where they join in the Dendrogram. If c > 0.75, tree is a good representation of the data. Hierarchical clustering is heavy for computers. Time complexity is O(n^2 log n) or O(n^3). It requires O(n^2) space to store the distance matrix. This means it is very slow for millions of rows. ✅💾
Summary 📝
Hierarchical Clustering helps to see the structure of data like a family tree. Use Ward’s Method for the best groups and the Dendrogram to pick the K value. Always check the Cophenetic Correlation to ensure the results are correct. 🌟 🙊😁
More: Link
✍️ @TheInfinityAI
Hierarchical clustering is finding groups by building a hierarchy. Unlike K-Means, we do not need to choose the number of groups (K) at the beginning. In this, creates a tree of data called a Dendrogram.
1. The Core Logic: Agglomerative Clustering 🏗
Most people use the Agglomerative (Bottom-Up) method for this,
● Every data point starts as its own small cluster.
● The machine finds the two clusters that are closest together.
● The machine joins (merges) them into one new cluster.
● The machine updates the distance between the new cluster and all other clusters.
● It repeats this until all data is in one big cluster.
2. Linkage Criteria: The Math of Merging 📏🧮
Linkage is a method used in hierarchical clustering to define how the distance between two clusters is computed. It is based on the distances between the data points in those clusters. Instead of measuring the distance between individual points, linkage tells the algorithm to how to measure the distance between groups of points (clusters).
I. Single Linkage (Minimum Distance)
It measures the distance between the two closest points in two clusters. It can create long and thin clusters. We call it as the Chaining Effect.
II. Complete Linkage (Maximum Distance)
It measures the distance between the two furthest points in two clusters. It avoids chaining and creates compact and round clusters.
III. Average Linkage
It calculates the average distance between all pairs of points in two clusters.
IV. Ward’s Method
It does not just look at distance. It looks at the variance also. Ward's joins two clusters only if the total within-cluster variance stays as small as possible. It will create very clear and equally sized clusters. It is the mathematically strongest one for general data.
3. The Dendrogram Analysis 📊✂️
The dendrogram is a visual representation of the hierarchical clustering process. It is showing how clusters are formed step by step.
● The vertical axis (height) represents the distance or dissimilarity at which clusters merge.
● Clusters that merge at lower heights are more similar than clusters that merge at higher heights.
To decide the number of clusters, 🎯
● Identify the largest vertical gap in the dendrogram (a region with a big jump in height where no merges occur).
● Draw a horizontal line across the dendrogram within this gap.
● Count the number of vertical branches intersected by the line.
Now, the number of branches crossed by the horizontal cut is the value of K.
4. Cophenetic Correlation and Performance 🧪⏳
We are using the Cophenetic Correlation Coefficient (c) to prove the tree's accuracy. It measures the correlation between the original distances of the data points and the distances where they join in the Dendrogram. If c > 0.75, tree is a good representation of the data. Hierarchical clustering is heavy for computers. Time complexity is O(n^2 log n) or O(n^3). It requires O(n^2) space to store the distance matrix. This means it is very slow for millions of rows. ✅💾
Summary 📝
Hierarchical Clustering helps to see the structure of data like a family tree. Use Ward’s Method for the best groups and the Dendrogram to pick the K value. Always check the Cophenetic Correlation to ensure the results are correct. 🌟 🙊😁
More: Link
✍️ @TheInfinityAI
Telegram
Infinity CS
❤3
Article 24: Association Rule Learning – Finding Hidden Patterns 🛒🔍
Association Rule Learning is a rule based machine learning method for discovering interesting relations between variables in large databases. It is famous for Market Basket Analysis. For example, if a customer buys bread and butter, they are also likely to buy milk.
1. Core Concepts 📊🧮
To find a good rule we use three main mathematical measurements,
2. The Apriori Algorithm (Level 1) 🔢
Apriori is one of the widely used algorithms for association rule mining. It is designed to identify frequent itemsets in a transactional dataset.. It is using a bottom-up approach. In this logic, it assumes that if an itemset is frequent, all its subsets must also be frequent. If an itemset is infrequent, all its supersets will also be infrequent (we call this Pruning).
In process, it will finds all individual items with support higher than a minimum threshold. Then, it will joins these items to create pairs (itemsets of size 2) and check their support. Repeats this for triplets (size 3) and larger sets until no more frequent sets are found.
3. FP-Growth Algorithm (Advanced Frequent Pattern Mining) 🚀🌳
Apriori is slow because it scans the whole database many times. FP-Growth (Frequent Pattern Growth) is advanced algorithm used to discover frequent itemsets more efficiently than Apriori.
✅ Advantages of FP-Growth,
✅ How FP-Growth Works
4. Advanced Rule Evaluation Metrics (Beyond Lift) 🧠⚖️
When evaluating association rules professionally, Lift alone is not enough always. Researchers figure on additional metrics like Conviction and Leverage to better understand the strength and usefulness of relationships between items. Because some rules can look strong statistically, but still misleading in practice.
Summary 📝
Association Rule Learning helps to find connections (like If A, then B). Apriori is a classic method that uses Pruning to save time. FP-Growth is a advanced choice uses a FP-Tree. Most probably, it can be faster. We are using Support, Confidence and Lift to decide if a rule is strong or just a coincidence.✨ 🙊😁. In the next article (Article 25), we will discuss about Anomaly Detection (Isolation Forest, LOF, & One-Class SVM). ✅⏭️ 🙊😁
✍️ @TheInfinityAI
Association Rule Learning is a rule based machine learning method for discovering interesting relations between variables in large databases. It is famous for Market Basket Analysis. For example, if a customer buys bread and butter, they are also likely to buy milk.
1. Core Concepts 📊🧮
To find a good rule we use three main mathematical measurements,
● Support - This shows how popular an itemset is in the whole dataset.
→ Support(A) = ((Number of transactions containing A) / (Total number of transactions))
● Confidence - This shows how likely item B is purchased when item A is purchased.
→ Confidence(A → B) = ((Support(A,B)) / (Support(A)))
● Lift -This shows the strength of the rule. If Lift is greater than 1, B is likely to be purchased if A is purchased. If Lift is 1, there is no relationship.
→ Lift(A → B) = ((Support(A,B)) / (Support(A) × Support(B)))
2. The Apriori Algorithm (Level 1) 🔢
Apriori is one of the widely used algorithms for association rule mining. It is designed to identify frequent itemsets in a transactional dataset.. It is using a bottom-up approach. In this logic, it assumes that if an itemset is frequent, all its subsets must also be frequent. If an itemset is infrequent, all its supersets will also be infrequent (we call this Pruning).
In process, it will finds all individual items with support higher than a minimum threshold. Then, it will joins these items to create pairs (itemsets of size 2) and check their support. Repeats this for triplets (size 3) and larger sets until no more frequent sets are found.
3. FP-Growth Algorithm (Advanced Frequent Pattern Mining) 🚀🌳
Apriori is slow because it scans the whole database many times. FP-Growth (Frequent Pattern Growth) is advanced algorithm used to discover frequent itemsets more efficiently than Apriori.
✅ Advantages of FP-Growth,
● It only scans the database twice.
● It stores the data in a special tree structure called an FP-Tree.
● After that, the algorithm works mostly with the tree in memory.
✅ How FP-Growth Works
Step 1 — Build the FP-Tree
● Removes infrequent items.
● Sorts remaining items by frequency.
● Inserts transactions into tree so shared prefixes overlap.
Step 2 — Mine the Tree (Divide-and-Conquer)
● Starts from the least frequent items
● Builds a Conditional FP-Tree for each item
● Recursively extracts frequent patterns
This is why it is called Frequent Pattern Growth. Patterns grow from smaller conditional structures.
4. Advanced Rule Evaluation Metrics (Beyond Lift) 🧠⚖️
When evaluating association rules professionally, Lift alone is not enough always. Researchers figure on additional metrics like Conviction and Leverage to better understand the strength and usefulness of relationships between items. Because some rules can look strong statistically, but still misleading in practice.
✅ Conviction — Measuring Rule Reliability
Conviction measures how strongly a rule depends on the relationship between A and B by comparing it to a scenario where they are independent.
✅ Leverage — Measuring True Co-Occurrence Gain
Leverage measures how much more often A and B occur together than we would expect if they were independent.
Summary 📝
Association Rule Learning helps to find connections (like If A, then B). Apriori is a classic method that uses Pruning to save time. FP-Growth is a advanced choice uses a FP-Tree. Most probably, it can be faster. We are using Support, Confidence and Lift to decide if a rule is strong or just a coincidence.✨ 🙊😁. In the next article (Article 25), we will discuss about Anomaly Detection (Isolation Forest, LOF, & One-Class SVM). ✅⏭️ 🙊😁
✍️ @TheInfinityAI
Telegram
Infinity CS
❤2👍1