The TikTok recommender system is widely regarded as one of the best in the world at the scale it operates at. It can recommend videos or ads, and even the other big tech companies could not compete. Recommending on a platform like TikTok is tough because the training data is non-stationary as a user's interest can change in a matter of minutes and the number of users, videos, and ads keeps changing.
The predictive performance of a recommender system on a social media platform deteriorates in a matter of hours, so it needs to be updated as often as possible. TikTok built a streaming engine to ensure the model is continuously trained in an online manner. The model server generates features for the model to recommend videos, and in return, the user interacts with the recommended items. This feedback loop leads to new training samples that are immediately sent to the training server. The training server holds a copy of the model, and the model parameters are updated in the parameter server. Every minute, the parameter server synchronizes itself with the production model.
The recommendation model is several terabytes in size, so it is very slow to synchronize such a big model across the network. That is why the model is only partially updated. The leading cause of non-stationary (concept drift) comes from the sparse variables (users, videos, ads, etc.) that are represented by embedding tables. When a user interacts with a recommended item, only the vectors associated with the user and the item get updated, as well as some of the weights on the network. Therefore, only the updated vectors get synchronized on a minute basis, and the network weights are synchronized on a longer time frame.
Typical recommender systems use fixed embedding tables, and the categories of the sparse variables get assigned to a vector through a hash function. Typically, the hash size is smaller than the number of categories, and multiple categories get assigned to the same vector. For example, multiple users share the same vector. This allows us to deal with the cold start problem for new users, and it puts a constraint on the maximum memory that the whole table will use. But this also tends to reduce the performance of the model because user behaviors get conflated. Instead, TikTok uses dynamic embedding sizes such that new users can be added to their own vector. They use a collisionless hashing function so each user gets its own vector. Low-activity users will not influence the model performance that much, so they dynamically remove those low-occurrence IDs as well as stale IDs. This keeps the embedding table small while preserving the quality of the model.
The predictive performance of a recommender system on a social media platform deteriorates in a matter of hours, so it needs to be updated as often as possible. TikTok built a streaming engine to ensure the model is continuously trained in an online manner. The model server generates features for the model to recommend videos, and in return, the user interacts with the recommended items. This feedback loop leads to new training samples that are immediately sent to the training server. The training server holds a copy of the model, and the model parameters are updated in the parameter server. Every minute, the parameter server synchronizes itself with the production model.
The recommendation model is several terabytes in size, so it is very slow to synchronize such a big model across the network. That is why the model is only partially updated. The leading cause of non-stationary (concept drift) comes from the sparse variables (users, videos, ads, etc.) that are represented by embedding tables. When a user interacts with a recommended item, only the vectors associated with the user and the item get updated, as well as some of the weights on the network. Therefore, only the updated vectors get synchronized on a minute basis, and the network weights are synchronized on a longer time frame.
Typical recommender systems use fixed embedding tables, and the categories of the sparse variables get assigned to a vector through a hash function. Typically, the hash size is smaller than the number of categories, and multiple categories get assigned to the same vector. For example, multiple users share the same vector. This allows us to deal with the cold start problem for new users, and it puts a constraint on the maximum memory that the whole table will use. But this also tends to reduce the performance of the model because user behaviors get conflated. Instead, TikTok uses dynamic embedding sizes such that new users can be added to their own vector. They use a collisionless hashing function so each user gets its own vector. Low-activity users will not influence the model performance that much, so they dynamically remove those low-occurrence IDs as well as stale IDs. This keeps the embedding table small while preserving the quality of the model.
Happy New Year Guys.
Have a wonderful year ahead. Keep learning
Have a wonderful year ahead. Keep learning
๐1
Happy New Year ๐ to our valued Telegram channel members! as we step into 2024, may each moment be adorned with joy and success. Your ongoing support is deeply appreciated. Here's to another year of shared growth and camaraderie! ๐
@ved1104
@ved1104
๐ฅ1
๐๐จ๐ฐ ๐๐จ๐๐ฌ ๐๐๐ ๐ฆ๐๐ง๐ฎ๐๐ฅ๐ฅ๐ฒ ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐? ๐๐ง๐ ๐๐ก๐ฒ ๐๐ก๐จ๐ฎ๐ฅ๐ ๐๐ ๐๐ง๐จ๐ฐ ๐๐ญ?
In data science, machine learning, and statistics, Principal Component Analysis (PCA) is a dimensionality-reduction method often used to reduce the dimensionality of large data sets by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables in a data set naturally comes at the expense of accuracy. Still, the trick in dimensionality reduction is to trade a little accuracy for simplicity. Smaller data sets are easier to explore and visualize, making analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
PCA finds directions for the maximal variance of the data. It finds mutually orthogonal directions. Mutually orthogonal means it's a global algorithm. Global means that all the directions and all the new features they find have a significant global constraint, namely that they must be mutually orthogonal.
Letโs see how we can manually compute PCA given some random table of values (see the illustration)
๐บ๐๐๐ 1: Standardize the dataset.
๐บ๐๐๐ 2: Calculate the covariance matrix for the features in the dataset.
๐บ๐๐๐ 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
๐บ๐๐๐ 4: Sort eigenvalues and their corresponding eigenvectors.
๐บ๐๐๐ 5: Calculate eigenvector for each eigenvalue using Cramerโs rule
๐บ๐๐๐ 6: Build eigenvectors matrix
๐บ๐๐๐ 7: Pick k eigenvalues and form a matrix of eigenvectors.
๐บ๐๐๐ 8: Transform the original matrix.
๐๐ง๐จ๐ฐ๐ข๐ง๐ ๐ก๐จ๐ฐ ๐ญ๐จ ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ ๐๐๐ ๐ฆ๐๐ง๐ฎ๐๐ฅ๐ฅ๐ฒ ๐๐๐ง ๐๐ ๐๐ฌ๐ฌ๐๐ง๐ญ๐ข๐๐ฅ ๐๐จ๐ซ ๐ฌ๐๐ฏ๐๐ซ๐๐ฅ:
โธ Conceptual understanding enhances your grasp of the underlying mathematical principles.
โธ Sometimes, we may need to customize the PCA process to suit specific requirements or constraints. Manual computation enables us to adapt PCA and adjust it to ๐จ๐ฎ๐ซ needs as necessary.
โธ Understanding the inner workings of PCA through manual computation can enhance our problem-solving skills in data analysis and dimensionality reduction. We will be better equipped to tackle complex data-related challenges.
โธ A solid grasp of manual PCA can be a foundation for understanding ๐ฆ๐จ๐ซ๐ ๐๐๐ฏ๐๐ง๐๐๐ ๐๐ข๐ฆ๐๐ง๐ฌ๐ข๐จ๐ง๐๐ฅ๐ข๐ญ๐ฒ ๐ซ๐๐๐ฎ๐๐ญ๐ข๐จ๐ง ๐ญ๐๐๐ก๐ง๐ข๐ช๐ฎ๐๐ฌ and related machine learning and data analysis methods.
โธ Manual computation can be a valuable educational tool if we teach or learn about PCA. It allows instructors and students to see how PCA works from a foundational perspective.
In data science, machine learning, and statistics, Principal Component Analysis (PCA) is a dimensionality-reduction method often used to reduce the dimensionality of large data sets by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables in a data set naturally comes at the expense of accuracy. Still, the trick in dimensionality reduction is to trade a little accuracy for simplicity. Smaller data sets are easier to explore and visualize, making analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
PCA finds directions for the maximal variance of the data. It finds mutually orthogonal directions. Mutually orthogonal means it's a global algorithm. Global means that all the directions and all the new features they find have a significant global constraint, namely that they must be mutually orthogonal.
Letโs see how we can manually compute PCA given some random table of values (see the illustration)
๐บ๐๐๐ 1: Standardize the dataset.
๐บ๐๐๐ 2: Calculate the covariance matrix for the features in the dataset.
๐บ๐๐๐ 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
๐บ๐๐๐ 4: Sort eigenvalues and their corresponding eigenvectors.
๐บ๐๐๐ 5: Calculate eigenvector for each eigenvalue using Cramerโs rule
๐บ๐๐๐ 6: Build eigenvectors matrix
๐บ๐๐๐ 7: Pick k eigenvalues and form a matrix of eigenvectors.
๐บ๐๐๐ 8: Transform the original matrix.
๐๐ง๐จ๐ฐ๐ข๐ง๐ ๐ก๐จ๐ฐ ๐ญ๐จ ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ ๐๐๐ ๐ฆ๐๐ง๐ฎ๐๐ฅ๐ฅ๐ฒ ๐๐๐ง ๐๐ ๐๐ฌ๐ฌ๐๐ง๐ญ๐ข๐๐ฅ ๐๐จ๐ซ ๐ฌ๐๐ฏ๐๐ซ๐๐ฅ:
โธ Conceptual understanding enhances your grasp of the underlying mathematical principles.
โธ Sometimes, we may need to customize the PCA process to suit specific requirements or constraints. Manual computation enables us to adapt PCA and adjust it to ๐จ๐ฎ๐ซ needs as necessary.
โธ Understanding the inner workings of PCA through manual computation can enhance our problem-solving skills in data analysis and dimensionality reduction. We will be better equipped to tackle complex data-related challenges.
โธ A solid grasp of manual PCA can be a foundation for understanding ๐ฆ๐จ๐ซ๐ ๐๐๐ฏ๐๐ง๐๐๐ ๐๐ข๐ฆ๐๐ง๐ฌ๐ข๐จ๐ง๐๐ฅ๐ข๐ญ๐ฒ ๐ซ๐๐๐ฎ๐๐ญ๐ข๐จ๐ง ๐ญ๐๐๐ก๐ง๐ข๐ช๐ฎ๐๐ฌ and related machine learning and data analysis methods.
โธ Manual computation can be a valuable educational tool if we teach or learn about PCA. It allows instructors and students to see how PCA works from a foundational perspective.
๐1