Aspiring Data Science
366 subscribers
420 photos
11 videos
10 files
1.84K links
Заметки экономиста о программировании, прогнозировании и принятии решений, научном методе познания.
Контакт: @fingoldo

I call myself a data scientist because I know just enough math, economics & programming to be dangerous.
Download Telegram
Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning by Sebastian Raschka

The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings.
This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning.

Link
https://arxiv.org/abs/1811.12808

Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #machinelearning #ml #modelevaluation #evaluation #selection #cv #crossvalidation

@accelerated_learning
Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Mindful Modeler by Christoph Molnar

The newsletter combines the best of two worlds: the performance mindset of machine learning and the mindfulness of statistical thinking.

Machine learning has become mainstream while falling short in the silliest ways: lack of interpretability, biased and missing data, wrong conclusions, … To statisticians, these shortcomings are often unsurprising. Statisticians are relentless in their quest to understand how the data came about. They make sure that their models reflect the data-generating process and interpret models accordingly.
In a sea of people who basically know how to model.fit() and model.predict() you can stand out by bringing statistical thinking to the arena.
Sign up for this newsletter to combine performance-driven machine learning with statistical thinking. Become a mindful modeller.

You'll learn about:
- Thinking like a statistician while performing like a machine learner
- Spotting non-obvious data problems
- Interpretable machine learning
- Other modelling mindsets such as causal inference and prompt engineering

Link
https://mindfulmodeler.substack.com/

Navigational hashtags: #armknowledgesharing #armnewsletters
General hashtags: #modelling #modeling #ml #machinelearning #statistics #modelinterpretation #data #interpretability #casualinference

@accelerated_learning
👍1
Forwarded from Artem Ryblov’s Data Science Weekly (Artem Ryblov)
Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson

The process of developing predictive models includes many stages. Most resources focus on the modelling algorithms, but neglect other critical aspects of the modelling process. This book describes techniques for finding the best representations of predictors for modelling and for finding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques, along with R programs for reproducing the results.

Table of Contents:
1. Introduction
2. Illustrative Example: Predicting Risk of Ischemic Stroke
3. A Review of the Predictive Modeling Process
4. Exploratory Visualizations
5. Encoding Categorical Predictors
6. Engineering Numeric Predictors
7. Detecting Interaction Effects
8. Handling Missing Data
9. Working with Profile Data
10. Feature Selection Overview
11. Greedy Search Methods
12. Global Search Methods

Links:
- http://www.feat.engineering/
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229
- https://www.routledge.com/Feature-Engineering-and-Selection-A-Practical-Approach-for-Predictive-Models/Kuhn-Johnson/p/book/9781138079229

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #featureengineering #featureselection #missingdata #categoricalvariables

@accelerated_learning
👍1
Google Machine Learning Education

Learn to build ML products with Google's Machine Learning Courses.

Foundational courses
The foundational courses cover machine learning fundamentals and core concepts. They recommend taking them in the order below.

1. Introduction to Machine Learning
A brief introduction to machine learning.
2. Machine Learning Crash Course
A hands-on course to explore the critical basics of machine learning.
3. Problem Framing
A course to help you map real-world problems to machine learning solutions.
4. Data Preparation and Feature Engineering
An introduction to preparing your data for ML workflows.
5. Testing and Debugging
Strategies for testing and debugging machine learning models and pipelines.

Advanced Courses
The advanced courses teach tools and techniques for solving a variety of machine learning problems. The courses are structured independently. Take them based on interest or problem domain.

- Decision Forests
Decision forests are an alternative to neural networks.
- Recommendation Systems
Recommendation systems generate personalized suggestions.
- Clustering
Clustering is a key unsupervised machine learning strategy to associate related items.
- Generative Adversarial Networks
GANs create new data instances that resemble your training data.
- Image Classification
Is that a picture of a cat or is it a dog?
- Fairness in Perspective API
Hands-on practice debugging fairness issues.

Guides
Their guides offer simple step-by-step walkthroughs for solving common machine learning problems using best practices.

- Rules of ML
Become a better machine learning engineer by following these machine learning best practices used at Google.
- People + AI Guidebook
This guide assists UXers, PMs, and developers in collaboratively working through AI design topics and questions.
- Text Classification
This comprehensive guide provides a walkthrough to solving text classification problems using machine learning.
- Good Data Analysis
This guide describes the tricks that an expert data analyst uses to evaluate huge data sets in machine learning problems.
- Deep Learning Tuning Playbook
This guide explains a scientific way to optimize the training of deep learning models.

Link: https://developers.google.com/machine-learning?hl=en

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #google #course #courses #featureengineering #recsys #clustering #gan

@data_science_weekly
Designing Machine Learning Systems by Chip Huyen

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems

Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems

@data_science_weekly
#churn #uplift

Не с первого взгляда, но я узнал логотип этого Flo Health. Эти ребята настолько "sort of" инновационны, что рекламу своего приложения по контролю месячных показывают на ютубе даже мне, мужику. Аплодирую стоя. Ценю настойчивость, через полгода такой рекламы, пожалуй, установлю себе.

Кстати, у подхода we predict churn to minimize %your business metric%->we maximize %your business metric% directly via uplift modelling просматриваются аналогии в трейдинге. Treatment=TradingAction, business metric=SharpeRatio. Single learner model - это по сути подход Эрни Чана (client features+treatment features vs market features+strategy actions).

Идея метода target transformation, котрую человеческим языком автора доклада не смог сформулировать, в том, что после преобразования позитивным аплифт становится для случаев когда мы подействовали и клиент остался, либо не действовали и клиент ушёл, что в принципе звучит логично.

Странно названный метод T-learner - это просто модификация 1го варианта, когда данные расщепляются по значениям treatment и для каждого куска строится отдельная модель. Как я уже рассказывал, эти 2 подхода имеет смысл тестировать вообще всегда, потому что никогда не знаешь заранее, какой сработает лучше в конкретной задаче и для конкретного алгоритма моделирования.

Так автор доклада и не сообщил, какой же из рассмотренных им методов у них в компании сработал лучше и почему, похоже, они тестировали какой-то один и сравнения не делали. Привёл только рост 7-24% ARPU (в зависимости от региона).

На 22:45 автор делится крайне интересной идеей: вместо стандартного избитого А/Б теста с красной и синей кнопками на сайте, и последующего определения одного для всех "цвета-победителя", не лучше ли предсказывать для каждого юзера оптимальный цвет кнопки, пытаясь максимизировать его кратко- и среднесрочные метрики?

Тупой пример: девочки любят красные кнопки, мальчики синие. Девочки составляют 65% аудитории сайта, соответственно, стандартный А/Б тест заключает, что кнопку надо ставить красную для всех. Тем самым мы вроде бы максимизируем полезность, при этом 35% аудитории ходят недовольные, совершают меньше покупок, итд.

https://www.youtube.com/watch?v=A6a1MbH4fFk
👍2
Designing Machine Learning Systems by Chip Huyen

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems

Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems

@data_science_weekly
👍2
The Kaggle Book by Konrad Banachewicz and Luca Massaron

Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.

The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you'll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won't easily find elsewhere, and the knowledge they've accumulated along the way. As well as Kaggle-specific tips, you'll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You'll design better validation schemes and work more comfortably with different evaluation metrics.

Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #featureengineering #kaggle #metrics #validation #hyperparameters #tabular #cv #nlp

@data_science_weekly
How to avoid machine learning pitfalls by Michael A. Lones

Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning.

This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them.

Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions.

It covers five stages of the machine learning process:
- What to do before model building
- How to reliably build models
- How to robustly evaluate models
- How to compare models fairly
- How to report results

Link: arXiv

Navigational hashtags: #armarticles
General hashtags: #ml #machinelearning #mlsystemdesign

@data_science_weekly