Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning by Sebastian Raschka
The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings.
This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.
Link
https://arxiv.org/abs/1811.12808
Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #machinelearning #ml #modelevaluation #evaluation #selection #cv #crossvalidation
@accelerated_learning
The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings.
This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.
Link
https://arxiv.org/abs/1811.12808
Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #machinelearning #ml #modelevaluation #evaluation #selection #cv #crossvalidation
@accelerated_learning
Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Mindful Modeler by Christoph Molnar
The newsletter combines the best of two worlds: the performance mindset of machine learning and the mindfulness of statistical thinking.
Machine learning has become mainstream while falling short in the silliest ways: lack of interpretability, biased and missing data, wrong conclusions, … To statisticians, these shortcomings are often unsurprising. Statisticians are relentless in their quest to understand how the data came about. They make sure that their models reflect the data-generating process and interpret models accordingly.
In a sea of people who basically know how to
Sign up for this newsletter to combine performance-driven machine learning with statistical thinking. Become a mindful modeller.
You'll learn about:
- Thinking like a statistician while performing like a machine learner
- Spotting non-obvious data problems
- Interpretable machine learning
- Other modelling mindsets such as causal inference and prompt engineering
Link
https://mindfulmodeler.substack.com/
Navigational hashtags: #armknowledgesharing #armnewsletters
General hashtags: #modelling #modeling #ml #machinelearning #statistics #modelinterpretation #data #interpretability #casualinference
@accelerated_learning
The newsletter combines the best of two worlds: the performance mindset of machine learning and the mindfulness of statistical thinking.
Machine learning has become mainstream while falling short in the silliest ways: lack of interpretability, biased and missing data, wrong conclusions, … To statisticians, these shortcomings are often unsurprising. Statisticians are relentless in their quest to understand how the data came about. They make sure that their models reflect the data-generating process and interpret models accordingly.
In a sea of people who basically know how to
model.fit()
and model.predict()
you can stand out by bringing statistical thinking to the arena.Sign up for this newsletter to combine performance-driven machine learning with statistical thinking. Become a mindful modeller.
You'll learn about:
- Thinking like a statistician while performing like a machine learner
- Spotting non-obvious data problems
- Interpretable machine learning
- Other modelling mindsets such as causal inference and prompt engineering
Link
https://mindfulmodeler.substack.com/
Navigational hashtags: #armknowledgesharing #armnewsletters
General hashtags: #modelling #modeling #ml #machinelearning #statistics #modelinterpretation #data #interpretability #casualinference
@accelerated_learning
Substack
Mindful Modeler | Christoph Molnar | Substack
Better machine learning by thinking like a statistician. About model interpretation, paying attention to data, and always staying critical. Click to read Mindful Modeler, by Christoph Molnar, a Substack publication with tens of thousands of subscribers.
👍1
Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson
The process of developing predictive models includes many stages. Most resources focus on the modelling algorithms, but neglect other critical aspects of the modelling process. This book describes techniques for finding the best representations of predictors for modelling and for finding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques, along with R programs for reproducing the results.
Table of Contents:
1. Introduction
2. Illustrative Example: Predicting Risk of Ischemic Stroke
3. A Review of the Predictive Modeling Process
4. Exploratory Visualizations
5. Encoding Categorical Predictors
6. Engineering Numeric Predictors
7. Detecting Interaction Effects
8. Handling Missing Data
9. Working with Profile Data
10. Feature Selection Overview
11. Greedy Search Methods
12. Global Search Methods
Links:
- http://www.feat.engineering/
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #featureengineering #featureselection #missingdata #categoricalvariables
@accelerated_learning
The process of developing predictive models includes many stages. Most resources focus on the modelling algorithms, but neglect other critical aspects of the modelling process. This book describes techniques for finding the best representations of predictors for modelling and for finding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques, along with R programs for reproducing the results.
Table of Contents:
1. Introduction
2. Illustrative Example: Predicting Risk of Ischemic Stroke
3. A Review of the Predictive Modeling Process
4. Exploratory Visualizations
5. Encoding Categorical Predictors
6. Engineering Numeric Predictors
7. Detecting Interaction Effects
8. Handling Missing Data
9. Working with Profile Data
10. Feature Selection Overview
11. Greedy Search Methods
12. Global Search Methods
Links:
- http://www.feat.engineering/
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #featureengineering #featureselection #missingdata #categoricalvariables
@accelerated_learning
👍1
Forwarded from Artem Ryblov’s Data Science Weekly
Google for Developers
Machine Learning | Google for Developers
Educational resources for machine learning.
Google Machine Learning Education
Learn to build ML products with Google's Machine Learning Courses.
Foundational courses
The foundational courses cover machine learning fundamentals and core concepts. They recommend taking them in the order below.
1. Introduction to Machine Learning
A brief introduction to machine learning.
2. Machine Learning Crash Course
A hands-on course to explore the critical basics of machine learning.
3. Problem Framing
A course to help you map real-world problems to machine learning solutions.
4. Data Preparation and Feature Engineering
An introduction to preparing your data for ML workflows.
5. Testing and Debugging
Strategies for testing and debugging machine learning models and pipelines.
Advanced Courses
The advanced courses teach tools and techniques for solving a variety of machine learning problems. The courses are structured independently. Take them based on interest or problem domain.
- Decision Forests
Decision forests are an alternative to neural networks.
- Recommendation Systems
Recommendation systems generate personalized suggestions.
- Clustering
Clustering is a key unsupervised machine learning strategy to associate related items.
- Generative Adversarial Networks
GANs create new data instances that resemble your training data.
- Image Classification
Is that a picture of a cat or is it a dog?
- Fairness in Perspective API
Hands-on practice debugging fairness issues.
Guides
Their guides offer simple step-by-step walkthroughs for solving common machine learning problems using best practices.
- Rules of ML
Become a better machine learning engineer by following these machine learning best practices used at Google.
- People + AI Guidebook
This guide assists UXers, PMs, and developers in collaboratively working through AI design topics and questions.
- Text Classification
This comprehensive guide provides a walkthrough to solving text classification problems using machine learning.
- Good Data Analysis
This guide describes the tricks that an expert data analyst uses to evaluate huge data sets in machine learning problems.
- Deep Learning Tuning Playbook
This guide explains a scientific way to optimize the training of deep learning models.
Link: https://developers.google.com/machine-learning?hl=en
Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #google #course #courses #featureengineering #recsys #clustering #gan
@data_science_weekly
Learn to build ML products with Google's Machine Learning Courses.
Foundational courses
The foundational courses cover machine learning fundamentals and core concepts. They recommend taking them in the order below.
1. Introduction to Machine Learning
A brief introduction to machine learning.
2. Machine Learning Crash Course
A hands-on course to explore the critical basics of machine learning.
3. Problem Framing
A course to help you map real-world problems to machine learning solutions.
4. Data Preparation and Feature Engineering
An introduction to preparing your data for ML workflows.
5. Testing and Debugging
Strategies for testing and debugging machine learning models and pipelines.
Advanced Courses
The advanced courses teach tools and techniques for solving a variety of machine learning problems. The courses are structured independently. Take them based on interest or problem domain.
- Decision Forests
Decision forests are an alternative to neural networks.
- Recommendation Systems
Recommendation systems generate personalized suggestions.
- Clustering
Clustering is a key unsupervised machine learning strategy to associate related items.
- Generative Adversarial Networks
GANs create new data instances that resemble your training data.
- Image Classification
Is that a picture of a cat or is it a dog?
- Fairness in Perspective API
Hands-on practice debugging fairness issues.
Guides
Their guides offer simple step-by-step walkthroughs for solving common machine learning problems using best practices.
- Rules of ML
Become a better machine learning engineer by following these machine learning best practices used at Google.
- People + AI Guidebook
This guide assists UXers, PMs, and developers in collaboratively working through AI design topics and questions.
- Text Classification
This comprehensive guide provides a walkthrough to solving text classification problems using machine learning.
- Good Data Analysis
This guide describes the tricks that an expert data analyst uses to evaluate huge data sets in machine learning problems.
- Deep Learning Tuning Playbook
This guide explains a scientific way to optimize the training of deep learning models.
Link: https://developers.google.com/machine-learning?hl=en
Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #google #course #courses #featureengineering #recsys #clustering #gan
@data_science_weekly
Forwarded from Artem Ryblov’s Data Science Weekly
Designing Machine Learning Systems by Chip Huyen
Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.
This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems
Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems
@data_science_weekly
Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.
This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems
Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems
@data_science_weekly
#churn #uplift
Не с первого взгляда, но я узнал логотип этого Flo Health. Эти ребята настолько "sort of" инновационны, что рекламу своего приложения по контролю месячных показывают на ютубе даже мне, мужику. Аплодирую стоя. Ценю настойчивость, через полгода такой рекламы, пожалуй, установлю себе.
Кстати, у подхода we predict churn to minimize %your business metric%->we maximize %your business metric% directly via uplift modelling просматриваются аналогии в трейдинге. Treatment=TradingAction, business metric=SharpeRatio. Single learner model - это по сути подход Эрни Чана (client features+treatment features vs market features+strategy actions).
Идея метода target transformation, котрую человеческим языком автора доклада не смог сформулировать, в том, что после преобразования позитивным аплифт становится для случаев когда мы подействовали и клиент остался, либо не действовали и клиент ушёл, что в принципе звучит логично.
Странно названный метод T-learner - это просто модификация 1го варианта, когда данные расщепляются по значениям treatment и для каждого куска строится отдельная модель. Как я уже рассказывал, эти 2 подхода имеет смысл тестировать вообще всегда, потому что никогда не знаешь заранее, какой сработает лучше в конкретной задаче и для конкретного алгоритма моделирования.
Так автор доклада и не сообщил, какой же из рассмотренных им методов у них в компании сработал лучше и почему, похоже, они тестировали какой-то один и сравнения не делали. Привёл только рост 7-24% ARPU (в зависимости от региона).
На 22:45 автор делится крайне интересной идеей: вместо стандартного избитого А/Б теста с красной и синей кнопками на сайте, и последующего определения одного для всех "цвета-победителя", не лучше ли предсказывать для каждого юзера оптимальный цвет кнопки, пытаясь максимизировать его кратко- и среднесрочные метрики?
Тупой пример: девочки любят красные кнопки, мальчики синие. Девочки составляют 65% аудитории сайта, соответственно, стандартный А/Б тест заключает, что кнопку надо ставить красную для всех. Тем самым мы вроде бы максимизируем полезность, при этом 35% аудитории ходят недовольные, совершают меньше покупок, итд.
https://www.youtube.com/watch?v=A6a1MbH4fFk
Не с первого взгляда, но я узнал логотип этого Flo Health. Эти ребята настолько "sort of" инновационны, что рекламу своего приложения по контролю месячных показывают на ютубе даже мне, мужику. Аплодирую стоя. Ценю настойчивость, через полгода такой рекламы, пожалуй, установлю себе.
Кстати, у подхода we predict churn to minimize %your business metric%->we maximize %your business metric% directly via uplift modelling просматриваются аналогии в трейдинге. Treatment=TradingAction, business metric=SharpeRatio. Single learner model - это по сути подход Эрни Чана (client features+treatment features vs market features+strategy actions).
Идея метода target transformation, котрую человеческим языком автора доклада не смог сформулировать, в том, что после преобразования позитивным аплифт становится для случаев когда мы подействовали и клиент остался, либо не действовали и клиент ушёл, что в принципе звучит логично.
Странно названный метод T-learner - это просто модификация 1го варианта, когда данные расщепляются по значениям treatment и для каждого куска строится отдельная модель. Как я уже рассказывал, эти 2 подхода имеет смысл тестировать вообще всегда, потому что никогда не знаешь заранее, какой сработает лучше в конкретной задаче и для конкретного алгоритма моделирования.
Так автор доклада и не сообщил, какой же из рассмотренных им методов у них в компании сработал лучше и почему, похоже, они тестировали какой-то один и сравнения не делали. Привёл только рост 7-24% ARPU (в зависимости от региона).
На 22:45 автор делится крайне интересной идеей: вместо стандартного избитого А/Б теста с красной и синей кнопками на сайте, и последующего определения одного для всех "цвета-победителя", не лучше ли предсказывать для каждого юзера оптимальный цвет кнопки, пытаясь максимизировать его кратко- и среднесрочные метрики?
Тупой пример: девочки любят красные кнопки, мальчики синие. Девочки составляют 65% аудитории сайта, соответственно, стандартный А/Б тест заключает, что кнопку надо ставить красную для всех. Тем самым мы вроде бы максимизируем полезность, при этом 35% аудитории ходят недовольные, совершают меньше покупок, итд.
https://www.youtube.com/watch?v=A6a1MbH4fFk
YouTube
Uplift Modelling - throw away your churn model. Ivan Klimuk
💬 “Uplift modelling is a powerful technique to make the best out of your marketing campaigns or any personalized user treatment with the help of machine learning,” says Ivan Klimuk, ML Engineer at Flo Health Inc.
Being one of the speakers at the Danske Technight…
Being one of the speakers at the Danske Technight…
👍2
Forwarded from Artem Ryblov’s Data Science Weekly
Designing Machine Learning Systems by Chip Huyen
Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.
This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems
Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems
@data_science_weekly
Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.
Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.
This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems
Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems
@data_science_weekly
👍2
Forwarded from Artem Ryblov’s Data Science Weekly
The Kaggle Book by Konrad Banachewicz and Luca Massaron
Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.
The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you'll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won't easily find elsewhere, and the knowledge they've accumulated along the way. As well as Kaggle-specific tips, you'll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You'll design better validation schemes and work more comfortably with different evaluation metrics.
Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.
Link: Book
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering #kaggle #metrics #validation #hyperparameters #tabular #cv #nlp
@data_science_weekly
Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.
The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you'll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won't easily find elsewhere, and the knowledge they've accumulated along the way. As well as Kaggle-specific tips, you'll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You'll design better validation schemes and work more comfortably with different evaluation metrics.
Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.
Link: Book
Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering #kaggle #metrics #validation #hyperparameters #tabular #cv #nlp
@data_science_weekly
Forwarded from Artem Ryblov’s Data Science Weekly
How to avoid machine learning pitfalls by Michael A. Lones
Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning.
This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them.
Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions.
It covers five stages of the machine learning process:
- What to do before model building
- How to reliably build models
- How to robustly evaluate models
- How to compare models fairly
- How to report results
Link: arXiv
Navigational hashtags: #armarticles
General hashtags: #ml #machinelearning #mlsystemdesign
@data_science_weekly
Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning.
This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them.
Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions.
It covers five stages of the machine learning process:
- What to do before model building
- How to reliably build models
- How to robustly evaluate models
- How to compare models fairly
- How to report results
Link: arXiv
Navigational hashtags: #armarticles
General hashtags: #ml #machinelearning #mlsystemdesign
@data_science_weekly