To estimate the adstock effect, Robyn has two options: the Geometric transformation and the
Weibull transformation.
The Geometric transformation is a simple single-parametric option and results in a simple but mediocre approximation.
The Weibull transformation is a more complex and approximates better, but it is more challenging to explain it to a non-tech-savvy audience. This transformation uses the Weibull survival function. The Weibull transformation requires more iterations to optimize due to its complexity, which is handled by the Nevergard optimization engine. We will talk about Nevergrad later.
To estimate the saturation effect and produce the S-curve, the Hill function with two parameters is used (greetings from the world of biochemistry!)
This is what it looks like in the model
#ArticleReview
Weibull transformation.
The Geometric transformation is a simple single-parametric option and results in a simple but mediocre approximation.
The Weibull transformation is a more complex and approximates better, but it is more challenging to explain it to a non-tech-savvy audience. This transformation uses the Weibull survival function. The Weibull transformation requires more iterations to optimize due to its complexity, which is handled by the Nevergard optimization engine. We will talk about Nevergrad later.
To estimate the saturation effect and produce the S-curve, the Hill function with two parameters is used (greetings from the world of biochemistry!)
This is what it looks like in the model
#ArticleReview
The regression model is trained to predict the target variable. We know the input features and how to transform them, but what about the loss function? There are at least two loss functions:
* The normalised root mean square error (NRMSE), to be minimised, and
* Decomposition root sum square distance (DECOMP.RSSD), an ingenious thing described as the “Business fit”.
There are a series of tweets on how awesome this is. Assume you have spent 80% of the budget on promotion in my Telegram channel, but the model suggests that 80% of conversion came from Instagram. Obviously, you better trust my channel, not the model.
But aren’t we building the model to understand what we are doing wrong? That is correct, but there are two assumptions:
* We assume that the budget was distributed not completely at random. The current allocation is reasonable before the start and iteratively moves towards an optional solution. This is similar to the Expectation-maximisation algorithm.
* If we approach the stakeholders and propose changing the $100M budget allocation, they can be surprised and not appreciate it. Instead, a proposal to alter the budget within 3–5% usually does not raise questions.
Having at least two loss functions means obtaining an infinite number of models. What do we do with that? The Nevergrad framework powers the optimisation process with a large number (thousands to tens of thousands) of iterations and displays the results on a two-dimensional plot (one dimension per loss function).
#ArticleReview
* The normalised root mean square error (NRMSE), to be minimised, and
* Decomposition root sum square distance (DECOMP.RSSD), an ingenious thing described as the “Business fit”.
There are a series of tweets on how awesome this is. Assume you have spent 80% of the budget on promotion in my Telegram channel, but the model suggests that 80% of conversion came from Instagram. Obviously, you better trust my channel, not the model.
But aren’t we building the model to understand what we are doing wrong? That is correct, but there are two assumptions:
* We assume that the budget was distributed not completely at random. The current allocation is reasonable before the start and iteratively moves towards an optional solution. This is similar to the Expectation-maximisation algorithm.
* If we approach the stakeholders and propose changing the $100M budget allocation, they can be surprised and not appreciate it. Instead, a proposal to alter the budget within 3–5% usually does not raise questions.
Having at least two loss functions means obtaining an infinite number of models. What do we do with that? The Nevergrad framework powers the optimisation process with a large number (thousands to tens of thousands) of iterations and displays the results on a two-dimensional plot (one dimension per loss function).
#ArticleReview
Twitter
Mike Taylor
So what is Decomp.RSSD? It's the decomposition root sum of squared distance, which basically means how far is your budget allocation from what the model is telling you about which channels are effective.
In hindsight, we can see that the resulting model gets closer to the bottom left corner as the number of iterations increases. The line denotes the Pareto front (you cannot improve one objective without worsening the others). The final model is chosen by the user based on the following criteria:
Experimental calibration. It makes sense to integrate the results of the experiments with causal relationships.
Business insights. It is worth comparing against a well-established prior on ROI, adstock, or benchmarks.
ROI distribution estimate for paid channels, ROAS. The narrower and taller the distribution is, the more confident we are in the results.
#ArticleReview
Experimental calibration. It makes sense to integrate the results of the experiments with causal relationships.
Business insights. It is worth comparing against a well-established prior on ROI, adstock, or benchmarks.
ROI distribution estimate for paid channels, ROAS. The narrower and taller the distribution is, the more confident we are in the results.
#ArticleReview
It took Meta a year and mass lay-offs to lock the laptop they left me with after I resigned from the company in Nov 2021.
Even though by the time I created a ticket for equipment return. They never came for iPhone and MacBook
I wrote to them with a question if they could unlock it, so I could transfer some data from it.
The answer came almost as a quote from the Kafka novel
The Company policy requires all Company-provided assets be returned upon your exit from the company. These assets are now inaccessible and locked in accordance with policy.
I will create a task on your behalf for equipment return. Can you please provide all of the information?
My answer was brief:
Thank you for the reply
I created a task for equipment return 14 months ago.
No worries, I can wait.
Even though by the time I created a ticket for equipment return. They never came for iPhone and MacBook
I wrote to them with a question if they could unlock it, so I could transfer some data from it.
The answer came almost as a quote from the Kafka novel
The Company policy requires all Company-provided assets be returned upon your exit from the company. These assets are now inaccessible and locked in accordance with policy.
I will create a task on your behalf for equipment return. Can you please provide all of the information?
My answer was brief:
Thank you for the reply
I created a task for equipment return 14 months ago.
No worries, I can wait.
Some of my thoughts on the difference between Data Scientists and Data Analysts
Recently I had a discussion at my company, Blockchain.com, about the difference between Data scientists and Data Analysts. Even though I gave a talk named — Why you would never find a Data Scientist, and I believe that Data Scientist is a terrible title, due to demand from my team, I had to keep this job title at Blockchain.com
If we can try to describe a variety of job titles in three dimensions (Domain-Dev-Math), it will be something like the preview to the Medium post (we don’t have all possible roles here, though)
In reality, we have even more skills and dimensions; thus, we use the following to be more precise.
Tech Lead — guides the approach and execution of a particular team. They partner closely with a single manager, but sometimes they partner with two or three managers within a focused area. They’re comfortable scoping complex tasks, coordinating their team towards solving them, and unblocking them along the way. Tech Leads often carry the team’s context and maintain many of the essential cross-team and cross-functional relationships necessary for the team’s success. They’re a close partner to the team’s product manager and the first person called when the roadmap needs to be shuffled. They write code and do code reviews. (Basically taken from Staff Engineer Book)
Data Steward is responsible for ensuring the quality and fitness of the organization’s data assets, including the metadata for those data assets: in our case, it includes projects like Attribution, Data Quality, Data Dictionary, Hierarchy of metrics, Metrics reconciliation, Segment and Amplitude ownership (working with end users and creating products for them). This also involves collaboration with the CS team, things like Automate Prime Users Tagging or working on specific integrations (Zen desk)
Data Engineer -> people who specify in writing high-quality production-ready code. They are responsible for the enhancement and maintenance of Data Lake, ETL pipelines, Data Infrastructure (Segment/Amplitude), migration of databases and tables, keeping production applications, and providing infrastructure for Machine Learning. A lot of coding + SQL + domain knowledge about data is needed.
Data Analyst -> very diverse role; I’ll provide some archetypes.
* Person closely working with product managers and stakeholders, answering questions about what happened, why this happened, what will happen and what we can do to prevent/improve/achieve; this usually requires visualization/storytelling/presentation skills.
* A person wrangling the data for finance is also a data analyst.
* SQL + minor to moderate coding + knowledge of math/stats + heavy domain knowledge
I review Data scientists’ role as a final step in the evolution of Data Analysts — all-around people. Data Scientists can do everything that DA can, but better: better coding, more profound domain knowledge (in a specific field), and they might have an additional focus on modelling (better stats/math as results). Modelling is just a tool that helps to answer the question of what will happen/why something happened and helps to find dependencies using a more complex set of self-derived rules.
Recently I had a discussion at my company, Blockchain.com, about the difference between Data scientists and Data Analysts. Even though I gave a talk named — Why you would never find a Data Scientist, and I believe that Data Scientist is a terrible title, due to demand from my team, I had to keep this job title at Blockchain.com
If we can try to describe a variety of job titles in three dimensions (Domain-Dev-Math), it will be something like the preview to the Medium post (we don’t have all possible roles here, though)
In reality, we have even more skills and dimensions; thus, we use the following to be more precise.
Tech Lead — guides the approach and execution of a particular team. They partner closely with a single manager, but sometimes they partner with two or three managers within a focused area. They’re comfortable scoping complex tasks, coordinating their team towards solving them, and unblocking them along the way. Tech Leads often carry the team’s context and maintain many of the essential cross-team and cross-functional relationships necessary for the team’s success. They’re a close partner to the team’s product manager and the first person called when the roadmap needs to be shuffled. They write code and do code reviews. (Basically taken from Staff Engineer Book)
Data Steward is responsible for ensuring the quality and fitness of the organization’s data assets, including the metadata for those data assets: in our case, it includes projects like Attribution, Data Quality, Data Dictionary, Hierarchy of metrics, Metrics reconciliation, Segment and Amplitude ownership (working with end users and creating products for them). This also involves collaboration with the CS team, things like Automate Prime Users Tagging or working on specific integrations (Zen desk)
Data Engineer -> people who specify in writing high-quality production-ready code. They are responsible for the enhancement and maintenance of Data Lake, ETL pipelines, Data Infrastructure (Segment/Amplitude), migration of databases and tables, keeping production applications, and providing infrastructure for Machine Learning. A lot of coding + SQL + domain knowledge about data is needed.
Data Analyst -> very diverse role; I’ll provide some archetypes.
* Person closely working with product managers and stakeholders, answering questions about what happened, why this happened, what will happen and what we can do to prevent/improve/achieve; this usually requires visualization/storytelling/presentation skills.
* A person wrangling the data for finance is also a data analyst.
* SQL + minor to moderate coding + knowledge of math/stats + heavy domain knowledge
I review Data scientists’ role as a final step in the evolution of Data Analysts — all-around people. Data Scientists can do everything that DA can, but better: better coding, more profound domain knowledge (in a specific field), and they might have an additional focus on modelling (better stats/math as results). Modelling is just a tool that helps to answer the question of what will happen/why something happened and helps to find dependencies using a more complex set of self-derived rules.
Medium
Is this person Data Scientist or Data Analyst?
Recently I had a discussion at my company, Blockchain.com, about the difference between Data scientists and Data Analysts. Even though I…
A short Video of me talking about Performance Review
YouTube
Performance Review in Big Tech | Валерий Бабушкин | karpov.courses
Присоединяйтесь к ближайшим стартам потоков!
Hard ML: http://bit.ly/3AwybaL
System Design: http://bit.ly/3V8Q6w2
По промокоду BIGTECH вы получите скидку 10% на обе программы
Вместе с Валерием Бабушкиным мы решили провести первый вебинар на английском языке…
Hard ML: http://bit.ly/3AwybaL
System Design: http://bit.ly/3V8Q6w2
По промокоду BIGTECH вы получите скидку 10% на обе программы
Вместе с Валерием Бабушкиным мы решили провести первый вебинар на английском языке…
I was working on my book and addressing the technical editor's comments about consistency metrics and lipschitzness.
Decided to ask chat GPT to help me with that
Provide some literature references about consistency metrics and lipschitzness.
"Consistency Metrics: A New Approach to Evaluating and Comparing Machine Learning Algorithms" by Zhi-Hua Zhou and Ji-Rong Wen (2013)
"Consistency of Nearest Neighbor Classifiers" by S. V. N. Vishwanathan, N. S. Nagaraj, and C. S. R. Murthy (2006)
"Consistency of Random Forests" by Luis Enrique Trejo and Luis Alejandro Correa (2013)
"On the Lipschitz Continuity of Neural Networks" by Q. Zhang, L. Zhang, and H. Zhang (2019)
"Estimating the Lipschitz Constant for Feedforward Neural Networks" by S. G. Tucker and J. D. Maitra (2017)
Turned out all the articles were fake. They do not exist, while people are real, at least those I checked.
After adding a request to provide links to the papers, chat GPT did it, and those likes worked. But they brought me to completely random and non-related articles.
It looks like Google Death has been postponed
Decided to ask chat GPT to help me with that
Provide some literature references about consistency metrics and lipschitzness.
"Consistency Metrics: A New Approach to Evaluating and Comparing Machine Learning Algorithms" by Zhi-Hua Zhou and Ji-Rong Wen (2013)
"Consistency of Nearest Neighbor Classifiers" by S. V. N. Vishwanathan, N. S. Nagaraj, and C. S. R. Murthy (2006)
"Consistency of Random Forests" by Luis Enrique Trejo and Luis Alejandro Correa (2013)
"On the Lipschitz Continuity of Neural Networks" by Q. Zhang, L. Zhang, and H. Zhang (2019)
"Estimating the Lipschitz Constant for Feedforward Neural Networks" by S. G. Tucker and J. D. Maitra (2017)
Turned out all the articles were fake. They do not exist, while people are real, at least those I checked.
After adding a request to provide links to the papers, chat GPT did it, and those likes worked. But they brought me to completely random and non-related articles.
It looks like Google Death has been postponed
I was rummaging through Kaggle Crackers channel at Open Data Science slack when I saw post from Vadim, who recently turned 17. Things he achieved impressed me and I decided to bring them here
Unfortunately, this year has been the toughest of my life and the lives of all Ukrainians, but that doesn't mean we should just give up and do nothing.
Overall, I don't have as impressive achievements as others who have written in ODS Kaggle Crackers, but why not write about them anyway?
One of my first achievements this year was becoming a Kaggle Discussion Grandmaster. After a couple of months of chatting, I entered the top 10 in the global ranking. It's a pretty useless achievement, but it does help to develop your social networks and allows you to get additional perks, which I'll talk about later.
I think everyone is familiar with the Weights & Biases library, so when I suggested on the Kaggle forum (where I was already a Discussion Grandmaster) how to best use the capabilities of Weights & Biases, they wrote to me and offered to become their Ambassador, which was very surprising (and yes, their conditions were simply out of this world). In parallel with the contract we were discussing, they decided to send me a t-shirt with the library's logo and a sticker with the logo (which I put on my laptop). However, they didn't want to sign a contract with me because I'm not of legal age (I'm 17) and it's not allowed by law. Being a child can be really disadvantageous sometimes.
During the summer holidays, I gained experience by participating in Kaggle competitions and won 2 bronze medals and 2 silver medals (I was close to winning a gold medal in one of them). Unfortunately, I couldn't actively participate in the last competition because there was a blackout in Ukraine and we didn't have electricity or water for several days. Nevertheless, my team and I won a t-shirt from Kaggle.
I also became a Kaggle Notebooks Master. This isn't a very significant achievement, but it's still nice to see your name in the top 100.
I also started writing on my blog, sharing my experience and insights on machine learning and Kaggle competitions. I've had a lot of positive feedback and even some offers to write guest posts on other blogs.
In general, I'm very grateful for the support and trust of the Kaggle community, and I hope that in the future, I will be able to help more people learn about machine learning and Kaggle competitions.
To conclude, I would like to say that this year was very strange for me, there were few happy moments (almost no training, terrible bombings, etc.), although the achievements are quite decent. The goal for next year: become a Kaggle Competitions Master.
Good luck to everyone! A peaceful sky overhead!
If you want to reach out to him and offer him a job/sponsorship, which by the look of it, he would nail - his Linkedin profile and email vadimirtlach@gmail.com
Unfortunately, this year has been the toughest of my life and the lives of all Ukrainians, but that doesn't mean we should just give up and do nothing.
Overall, I don't have as impressive achievements as others who have written in ODS Kaggle Crackers, but why not write about them anyway?
One of my first achievements this year was becoming a Kaggle Discussion Grandmaster. After a couple of months of chatting, I entered the top 10 in the global ranking. It's a pretty useless achievement, but it does help to develop your social networks and allows you to get additional perks, which I'll talk about later.
I think everyone is familiar with the Weights & Biases library, so when I suggested on the Kaggle forum (where I was already a Discussion Grandmaster) how to best use the capabilities of Weights & Biases, they wrote to me and offered to become their Ambassador, which was very surprising (and yes, their conditions were simply out of this world). In parallel with the contract we were discussing, they decided to send me a t-shirt with the library's logo and a sticker with the logo (which I put on my laptop). However, they didn't want to sign a contract with me because I'm not of legal age (I'm 17) and it's not allowed by law. Being a child can be really disadvantageous sometimes.
During the summer holidays, I gained experience by participating in Kaggle competitions and won 2 bronze medals and 2 silver medals (I was close to winning a gold medal in one of them). Unfortunately, I couldn't actively participate in the last competition because there was a blackout in Ukraine and we didn't have electricity or water for several days. Nevertheless, my team and I won a t-shirt from Kaggle.
I also became a Kaggle Notebooks Master. This isn't a very significant achievement, but it's still nice to see your name in the top 100.
I also started writing on my blog, sharing my experience and insights on machine learning and Kaggle competitions. I've had a lot of positive feedback and even some offers to write guest posts on other blogs.
In general, I'm very grateful for the support and trust of the Kaggle community, and I hope that in the future, I will be able to help more people learn about machine learning and Kaggle competitions.
To conclude, I would like to say that this year was very strange for me, there were few happy moments (almost no training, terrible bombings, etc.), although the achievements are quite decent. The goal for next year: become a Kaggle Competitions Master.
Good luck to everyone! A peaceful sky overhead!
If you want to reach out to him and offer him a job/sponsorship, which by the look of it, he would nail - his Linkedin profile and email vadimirtlach@gmail.com
I read a fairly long article on medium about how Prophet works almost always worse than Arima and often even worse than simple exponential smoothing in the task of predicting time series.
There's little to comment on - one after another - author provides different examples to rant about, and even proposals for how to fix it, except for the moment that it seems where the data becomes larger - Prophet works quite well.
From the interesting:
1. I guess nobody gets to be a famous Prophet by making mundane safe predictions, but at minimum, one should be aware of some Prophet mechanics. For instance, the last 20% of data points are not used to estimate the trend component. Did you know that? Did you expect that? I didn't. Let it sink in.
2. A paper considering Prophet by Jung, Kim, Kwak and Park comes with the title A Worrying Analysis of Probabilistic Time-series Models for Sales Forecasting (pdf). As the spoiler suggests, things aren't looking rosy. The authors list Facebook's Prophet as the worst performing of all algorithms tested. Oh boy.
Ah, you object, but under what metric? Maybe the scoring rule used was unfair and not well suited to sales of Facebook portals? That may be, but according to those authors Prophet was the worst uniformly across all metrics — last in every race. Those criteria included RMSE and MAPE as you would expect, but also mean normalized quantile loss where (one might have hoped) the Bayesian approach could yield better distributional prediction than alternatives. The author's explanation is, I think, worth reproducing in full.
3. Yes, you can imagine my disappointment when, out-of-the-box, Prophet was beaten soundly by a "take the last value" forecast but probably that was a tad unlucky (even if it did send me scurrying to google to see if anyone else had a similar experience).
Although my experience of using Prophet is similar - fine-tuned (S)ARIMA - can outperform it, it is interesting to find out how it was for others; feel free to write in the comments
There's little to comment on - one after another - author provides different examples to rant about, and even proposals for how to fix it, except for the moment that it seems where the data becomes larger - Prophet works quite well.
From the interesting:
1. I guess nobody gets to be a famous Prophet by making mundane safe predictions, but at minimum, one should be aware of some Prophet mechanics. For instance, the last 20% of data points are not used to estimate the trend component. Did you know that? Did you expect that? I didn't. Let it sink in.
2. A paper considering Prophet by Jung, Kim, Kwak and Park comes with the title A Worrying Analysis of Probabilistic Time-series Models for Sales Forecasting (pdf). As the spoiler suggests, things aren't looking rosy. The authors list Facebook's Prophet as the worst performing of all algorithms tested. Oh boy.
Ah, you object, but under what metric? Maybe the scoring rule used was unfair and not well suited to sales of Facebook portals? That may be, but according to those authors Prophet was the worst uniformly across all metrics — last in every race. Those criteria included RMSE and MAPE as you would expect, but also mean normalized quantile loss where (one might have hoped) the Bayesian approach could yield better distributional prediction than alternatives. The author's explanation is, I think, worth reproducing in full.
3. Yes, you can imagine my disappointment when, out-of-the-box, Prophet was beaten soundly by a "take the last value" forecast but probably that was a tad unlucky (even if it did send me scurrying to google to see if anyone else had a similar experience).
Although my experience of using Prophet is similar - fine-tuned (S)ARIMA - can outperform it, it is interesting to find out how it was for others; feel free to write in the comments
Medium
Is Facebook’s “Prophet” the Time-Series Messiah, or Just a Very Naughty Boy?
Facebook’s Prophet package aims to provide a simple, automated approach to the prediction of a large number of different time series. The…
I recently read an article from folks at TikTok called Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations"
Recommendation systems need to be able to quickly obtain relatively relevant candidates, which are then reranked to produce the final output.
Typically, for candidate generation, an inner-product model (such as metric learning) is used, followed by an ANN (approximate nearest neighbour; a popular option is FAISS.
In this article, however, the authors want to show how retrieval can be done directly through item-user interactions without making assumptions about the Euclidean nature of the space and the proximity of entities within it (which, in my opinion, isn't really a problem, considering that in metric learning we specifically train for this kind of representation).
The authors train a model with D layers. Each layer consists of an MLP with a softmax function over K nodes that outputs the probability of belonging to one of the K clusters. The input to layer D1 is an embedding of the user( embedding takes into account their previous actions, and a recurrent neural network with GRU is used to project the behaviour sequence onto a fixed dimension embedding as the input). The target is the cluster of the item with which the user has interacted with (e.g., clicked or purchased). The output of D1, let's call it K1, is then concatenated with user embedding and used as input for D2. An output K2 is then concatenated with K1 and user embedding and used as an input for D3.
Any user potentially has K^D different paths. For example, if there are 30 clusters and three layers, the model can output the following path for user X: 1-10-15, meaning cluster 1 among the first 30, cluster 10 among the next 30, and cluster 15 among the next 30. Additionally, because we have a distribution, we can go deeper and take the top-3 from each layer, resulting in n^D (27 in our case) different paths instead of the original one. Since we train on user-item interactions, we can get paths for both the user and the item.
A question arises: how can an item belong to different clusters? For example, an item related to a kebab could belong to a "food" cluster, while an item related to flowers could belong to a "gift" cluster. However, an item related to chocolate or cakes could belong to both clusters in order to be recommended to users interested in either food or gifts.
This is actually one of the advantages over tree-based deep models.
The reasonable question is, how do we determine the initial clusters? Okay, we have user embeddings and user-item interactions, but where do we get labels for K? We can randomly distribute and turn on the EM machine.
In the first iteration, we distribute them randomly and train the model; then, we re-train the item's mapping to the cluster to maximise the model output.
How is Deep retrieval applied during inference?
We input the user's embedding -> we get N paths (the greedy algorithm outputs one path)
We gather all items that are in these paths
We run them through the intermediate reranker
There are a few additional points.
I. Despite the fact that DR (Deep Retrieval) outputs significantly fewer items than all, there are still many of them, so it is also trained with a reranker in order to output the top (this is still not the final reranker!)
II. The mapping to clusters is discrete, so it cannot be updated with gradient methods (hence the use of EM)
III. They add a penalty for adding another item (passing the same path) to the path. Otherwise, there is a risk that all items will fall into one path, and they used a penalty in the form of c^4/4 where c is the number of items in the path
IIII. They updated the model from the incoming data stream - this affected some things, such as step M in EM. They also used exponential decay with a coefficient of 0.999
Metrics
Recommendation systems need to be able to quickly obtain relatively relevant candidates, which are then reranked to produce the final output.
Typically, for candidate generation, an inner-product model (such as metric learning) is used, followed by an ANN (approximate nearest neighbour; a popular option is FAISS.
In this article, however, the authors want to show how retrieval can be done directly through item-user interactions without making assumptions about the Euclidean nature of the space and the proximity of entities within it (which, in my opinion, isn't really a problem, considering that in metric learning we specifically train for this kind of representation).
The authors train a model with D layers. Each layer consists of an MLP with a softmax function over K nodes that outputs the probability of belonging to one of the K clusters. The input to layer D1 is an embedding of the user( embedding takes into account their previous actions, and a recurrent neural network with GRU is used to project the behaviour sequence onto a fixed dimension embedding as the input). The target is the cluster of the item with which the user has interacted with (e.g., clicked or purchased). The output of D1, let's call it K1, is then concatenated with user embedding and used as input for D2. An output K2 is then concatenated with K1 and user embedding and used as an input for D3.
Any user potentially has K^D different paths. For example, if there are 30 clusters and three layers, the model can output the following path for user X: 1-10-15, meaning cluster 1 among the first 30, cluster 10 among the next 30, and cluster 15 among the next 30. Additionally, because we have a distribution, we can go deeper and take the top-3 from each layer, resulting in n^D (27 in our case) different paths instead of the original one. Since we train on user-item interactions, we can get paths for both the user and the item.
A question arises: how can an item belong to different clusters? For example, an item related to a kebab could belong to a "food" cluster, while an item related to flowers could belong to a "gift" cluster. However, an item related to chocolate or cakes could belong to both clusters in order to be recommended to users interested in either food or gifts.
This is actually one of the advantages over tree-based deep models.
The reasonable question is, how do we determine the initial clusters? Okay, we have user embeddings and user-item interactions, but where do we get labels for K? We can randomly distribute and turn on the EM machine.
In the first iteration, we distribute them randomly and train the model; then, we re-train the item's mapping to the cluster to maximise the model output.
How is Deep retrieval applied during inference?
We input the user's embedding -> we get N paths (the greedy algorithm outputs one path)
We gather all items that are in these paths
We run them through the intermediate reranker
There are a few additional points.
I. Despite the fact that DR (Deep Retrieval) outputs significantly fewer items than all, there are still many of them, so it is also trained with a reranker in order to output the top (this is still not the final reranker!)
II. The mapping to clusters is discrete, so it cannot be updated with gradient methods (hence the use of EM)
III. They add a penalty for adding another item (passing the same path) to the path. Otherwise, there is a risk that all items will fall into one path, and they used a penalty in the form of c^4/4 where c is the number of items in the path
IIII. They updated the model from the incoming data stream - this affected some things, such as step M in EM. They also used exponential decay with a coefficient of 0.999
Metrics
Of course, any paper will show that they are better, both offline and online, but for some reason, there is a low recall everywhere. For example, Recall@200 is about 13% - which raises questions; maybe if we go deeper - other algorithms are not worse. The SOTA Recall@200 for the mentioned dataset is around 28%
The results of the A/B test showed as a point estimate, which also raises many questions, because confidence intervals were given for offline tests, but not online.
The approach looks interesting, but it would be nice to see a more detailed ablation study and results on a more significant number of datasets.
It is surprising to see the comparison on the movie lens and Amazon books datasets and then only one A/B test in TikTok (which is quite a different setup) without confidence intervals and a lack of understanding of what a competing system leaves questions.
#ArticleReview
The results of the A/B test showed as a point estimate, which also raises many questions, because confidence intervals were given for offline tests, but not online.
The approach looks interesting, but it would be nice to see a more detailed ablation study and results on a more significant number of datasets.
It is surprising to see the comparison on the movie lens and Amazon books datasets and then only one A/B test in TikTok (which is quite a different setup) without confidence intervals and a lack of understanding of what a competing system leaves questions.
#ArticleReview
I ve just read a mind blowing article - The Waluigi Effect (mega-post)
To simplify - each LLM (large Language Model) has an antagonist living inside, ready to lie, manipulate and misbehave
Quotes
This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query.
In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum.
The Waluigi Effect: After you train an LLM to satisfy a desirable property P , then it's easier to elicit the chatbot into satisfying the exact opposite of property P.
A narrative/plot is a sequence of fictional events, where each event will typically involve different characters interacting with each other. Narratology is the study of the plots found in literature and films, and structuralist narratology is the study of the common structures/regularities that are found in these plots. For the purposes of this article, you can think of "structuralist narratology" as just a fancy academic term for whatever tv tropes is doing.
Definition (half-joking): A large language model is a structural narratologist.
Think about your own experience reading a book — once the author describes the protagonist, then you can guess the traits of the antagonist by inverting the traits of the protagonist. You can also guess when the protagonist and antagonist will first interact, and what will happen when they do. Now, an LLM is roughly as good as you at structural narratology — GPT-4 has read every single book ever written — so the LLM can make the same guesses as yours. There's a sense in which all GPT-4 does is structural narratology.
The chatbob starts as a superposition of luigi and waluigi. So any behaviour that is likely for waluigi is somewhat likely for the chatbob. So it is somewhat likely that the chatbob declares pro-croissant loyalties.
And if the chatbob ever declares pro-croissant loyalties, then the luigi simulacrum will permanently vanish from the superposition because that behaviour is implausible for a luigi.
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.
If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe.
To simplify - each LLM (large Language Model) has an antagonist living inside, ready to lie, manipulate and misbehave
Quotes
This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query.
In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum.
The Waluigi Effect: After you train an LLM to satisfy a desirable property P , then it's easier to elicit the chatbot into satisfying the exact opposite of property P.
A narrative/plot is a sequence of fictional events, where each event will typically involve different characters interacting with each other. Narratology is the study of the plots found in literature and films, and structuralist narratology is the study of the common structures/regularities that are found in these plots. For the purposes of this article, you can think of "structuralist narratology" as just a fancy academic term for whatever tv tropes is doing.
Definition (half-joking): A large language model is a structural narratologist.
Think about your own experience reading a book — once the author describes the protagonist, then you can guess the traits of the antagonist by inverting the traits of the protagonist. You can also guess when the protagonist and antagonist will first interact, and what will happen when they do. Now, an LLM is roughly as good as you at structural narratology — GPT-4 has read every single book ever written — so the LLM can make the same guesses as yours. There's a sense in which all GPT-4 does is structural narratology.
The chatbob starts as a superposition of luigi and waluigi. So any behaviour that is likely for waluigi is somewhat likely for the chatbob. So it is somewhat likely that the chatbob declares pro-croissant loyalties.
And if the chatbob ever declares pro-croissant loyalties, then the luigi simulacrum will permanently vanish from the superposition because that behaviour is implausible for a luigi.
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation "when is avatar showing today" is a good example.
If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe.
Lesswrong
The Waluigi Effect (mega-post) — LessWrong
Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung …
I spent last week in Uzbekistan and had an opportunity to present a brief talk on metrics and losses, a small extract from the coming book: Principles of ML system design
YouTube
Valeriy Babushkin | Metrics And Losses When Designing ML System
Finally, our book has been published in Early Access mode: Machine Learning System Design With end-to-end examples
Right now, you can dig into the first 5 chapters, and we're planning to drop a new chapter every two weeks and fix any annoying typos. There's a discount on the book until May 9th: MEAP launch code: mlbabushkin (45% off Machine Learning System Design in all formats).
Right now, you can dig into the first 5 chapters, and we're planning to drop a new chapter every two weeks and fix any annoying typos. There's a discount on the book until May 9th: MEAP launch code: mlbabushkin (45% off Machine Learning System Design in all formats).
Manning Publications
Machine Learning System Design
Get the big picture and the important details with this end-to-end guide for designing highly effective, reliable machine learning systems.</b>
In Machine Learning System Design: With end-to-end examples</i> you will learn:
The big picture of machine…
In Machine Learning System Design: With end-to-end examples</i> you will learn:
The big picture of machine…
It’s time to reveal where I work now.
The company is called bp formerly known as British Petroleum.
It’s a large company headquartered in London.
We do many things: oil, gas, fuel, retail, Biolabs (plastic-consuming bacteria development), aviation, lubricants (Castrol), wind and solar energy, electric car charging stations, biofuel, hydrogen fuel, trading, logistics, venture capital (at one point BP invested in the now-famous Palantir) and much more.
I started as Senior Principal D., taking up the burden of leading the DataWorx Customer & Product and Trading & Shipping team.
The team is relatively large, with around 600 people, among them data engineers, data analysts, data scientists and ML engineers.
For example, we are hiring four open positions for directors/principals (both managers and ICs) who will report directly to me. On top of that, we have a substantial number of staff and senior openings. We’ve got a ton of focus areas and skills needed for that.
So, if I know you and we have worked together before, but I still haven’t approached you, feel free to email me. Even if I don’t know you, but you think you’d fit in, especially in a senior, staff or principal position, you should still DM me. The main locations are the UK/the US/India/Australia, and Kuala Lumpur. By the way, if you’ve been eager to move to Australia (Melbourne), UK or KL, this is a great opportunity (you can send your CV to my email valerii.babushkin@bp.com or submit a request on BP website).
The company is called bp formerly known as British Petroleum.
It’s a large company headquartered in London.
We do many things: oil, gas, fuel, retail, Biolabs (plastic-consuming bacteria development), aviation, lubricants (Castrol), wind and solar energy, electric car charging stations, biofuel, hydrogen fuel, trading, logistics, venture capital (at one point BP invested in the now-famous Palantir) and much more.
I started as Senior Principal D., taking up the burden of leading the DataWorx Customer & Product and Trading & Shipping team.
The team is relatively large, with around 600 people, among them data engineers, data analysts, data scientists and ML engineers.
For example, we are hiring four open positions for directors/principals (both managers and ICs) who will report directly to me. On top of that, we have a substantial number of staff and senior openings. We’ve got a ton of focus areas and skills needed for that.
So, if I know you and we have worked together before, but I still haven’t approached you, feel free to email me. Even if I don’t know you, but you think you’d fit in, especially in a senior, staff or principal position, you should still DM me. The main locations are the UK/the US/India/Australia, and Kuala Lumpur. By the way, if you’ve been eager to move to Australia (Melbourne), UK or KL, this is a great opportunity (you can send your CV to my email valerii.babushkin@bp.com or submit a request on BP website).