Artem Ryblov’s Data Science Weekly
226 subscribers
61 photos
86 links
@artemfisherman’s Data Science Weekly: Elevate your expertise with a standout data science resource each week, carefully chosen for depth and impact.
Long-form content: https://artemryblov.substack.com
Download Telegram
Dive into Deep Learning

- Interactive deep learning book with code, maths, and discussions.
- Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow.
- Adopted at 400 universities from 60 countries.

Content and Structure

The book can be divided into roughly three parts, focusing on preliminaries, deep learning techniques, and advanced topics focused on real systems and applications:

Part 1: Basics and Preliminaries. Section 1 offers an introduction to deep learning. Then, in Section 2, we quickly bring you up to speed on the prerequisites required for hands-on deep learning, such as how to store and manipulate data, and how to apply various numerical operations based on basic concepts from linear algebra, calculus, and probability. Section 3 and Section 5 cover the most basic concepts and techniques in deep learning, including regression and classification; linear models; multilayer perceptrons; and overfitting and regularization.

Part 2: Modern Deep Learning Techniques. Section 6 describes the key computational components of deep learning systems and lays the groundwork for our subsequent implementations of more complex models. Next, Section 7 and Section 8 introduce convolutional neural networks (CNNs), powerful tools that form the backbone of most modern computer vision systems. Similarly, Section 9 and Section 10 introduce recurrent neural networks (RNNs), models that exploit sequential (e.g., temporal) structure in data and are commonly used for natural language processing and time series prediction. In Section 11, we introduce a relatively new class of models based on so-called attention mechanisms that has displaced RNNs as the dominant architecture for most natural language processing tasks. These sections will bring you up to speed on the most powerful and general tools that are widely used by deep learning practitioners.

Part 3: Scalability, Efficiency, and Applications. In Section 12, we discuss several common optimization algorithms used to train deep learning models. Next, in Section 13, we examine several key factors that influence the computational performance of deep learning code. Then, in Section 14, we illustrate major applications of deep learning in computer vision. Finally, in Section 15 and Section 16, we demonstrate how to pretrain language representation models and apply them to natural language processing tasks. This part is available online.

Navigational hashtags: #armknowledgesharing #armbooks #armcourses
General hashtags: #deeplearning #dl #tensorflow #pytorch #jax #numpy #computervision #naturallanguageprocessing #attention #neuralnetworks #algorithms

@data_science_weekly
Efficient Python Tricks and Tools for Data Scientists

"Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

Why for data scientist? Because Python has a wide application. The Python tools used in the data science field are not necessarily useful for other fields, such as web development.

The goal of this book is to spread the awareness of efficient ways to do Python.
They include:
- efficient methods and libraries to work with iterator, dictionary, function, and class
- efficient methods to work with popular data science libraries such as pandas and NumPy
- efficient tools to incorporate in a data science project
- efficient tools to incorporate in any project
- efficient tools to work with Jupyter Notebook."

About The Author
Khuyen Tran wrote over 150 data science articles with 100k+ views per month on Towards Data Science. She also wrote 500+ daily data science tips at Data Science Simplified. Her current mission is to make open-source more accessible to the data science community.

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #python #pandas #datascientists #datascientist #datamanagement #datamining #pythonprogramminglanguage #datascience #jupyternotebook

@data_science_weekly
Geographic Data Science with Python

This book provides the tools, the methods, and the theory to meet the challenges of contemporary data science applied to geographic problems and data. Social media, new forms of data, and new computational techniques are revolutionizing social science. In the new world of pervasive, large, frequent, and rapid data, we have new opportunities to understand and analyse the role of geography in everyday life. This book provides the first comprehensive curriculum in geographic data science.

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #datascience #geospatial #geospatialdata #geographic #python #data #science

@data_science_weekly
An Introduction to Statistical Learning with applications in PYTHON!

As the scale and scope of data collection continue to increase across virtually all fields, statistical learning has become a critical toolkit for anyone who wishes to understand data. An Introduction to Statistical Learning provides a broad and less technical treatment of key topics in statistical learning.

The Python edition (ISLP) was published in 2023.

The chapters cover the following topics:
- What is statistical learning?
- Regression
- Classification
- Resampling methods
- Linear model selection and regularization
- Moving beyond linearity
- Tree-based methods
- Support vector machines
- Deep learning
- Survival analysis
- Unsupervised learning
- Multiple testing

Link: https://www.statlearning.com

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ISLR #ISLP #regression #classification #resampling #linearmodels #regularization #trees #svm #deeplearning #unsupervisedlearning #abtesting

@data_science_weekly
Machine Learning System Design by Valerii Babushkin and Arseny Kravchenko

Get the big picture and the important details with this end-to-end guide for designing highly effective, reliable machine learning systems.

In "Machine Learning System Design: With end-to-end examples" you will learn:
- The big picture of machine learning system design
- Analyzing a problem space to identify the optimal ML solution
- Ace ML system design interviews
- Selecting appropriate metrics and evaluation criteria
- Prioritizing tasks at different stages of ML system design
- Solving dataset-related problems through data gathering, error analysis, and feature engineering
- Recognizing common pitfalls in ML system development
- Designing ML systems to be lean, maintainable, and extensible over time

Link: https://www.manning.com/books/machine-learning-system-design

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #systemdesign #machinelearningsystemdesign

@data_science_weekly
Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson

The process of developing predictive models includes many stages. Most resources focus on the modelling algorithms, but neglect other critical aspects of the modelling process. This book describes techniques for finding the best representations of predictors for modelling and for finding the best subset of predictors for improving model performance. A variety of example data sets are used to illustrate the techniques, along with R programs for reproducing the results.

Table of Contents:
1. Introduction
2. Illustrative Example: Predicting Risk of Ischemic Stroke
3. A Review of the Predictive Modeling Process
4. Exploratory Visualizations
5. Encoding Categorical Predictors
6. Engineering Numeric Predictors
7. Detecting Interaction Effects
8. Handling Missing Data
9. Working with Profile Data
10. Feature Selection Overview
11. Greedy Search Methods
12. Global Search Methods

Links:
- Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #featureengineering #featureselection #missingdata #categoricalvariables

@data_science_weekly
Understanding Deep Learning by Simon J.D. Prince

Deep learning is a fast-moving field with sweeping relevance in today’s increasingly digital world. Understanding Deep Learning provides an authoritative, accessible, and up-to-date treatment of the subject, covering all the key topics along with recent advances and cutting-edge concepts. Many deep learning texts are crowded with technical details that obscure fundamentals, but Simon Prince ruthlessly curates only the most important ideas to provide a high density of critical information in an intuitive and digestible form. From machine learning basics to advanced models, each concept is presented in lay terms and then detailed precisely in mathematical form and illustrated visually. The result is a lucid, self-contained textbook suitable for anyone with a basic background in applied mathematics.

- Up-to-date treatment of deep learning covers cutting-edge topics not found in existing texts, such as transformers and diffusion models
- Short, focused chapters progress in complexity, easing students into difficult concepts
- Pragmatic approach straddling theory and practice gives readers the level of detail required to implement naive versions of models
- Streamlined presentation separates critical ideas from background context and extraneous detail
- Minimal mathematical prerequisites, extensive illustrations, and practice problems make challenging material widely accessible
- Programming exercises offered in accompanying Python Notebooks

Link: https://udlbook.github.io/udlbook/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #ml #machinelearning #dl #deeplearning #transformers #diffusion

@data_science_weekly
Thinking Clearly with Data: A Guide to Quantitative Reasoning and Analysis by Ethan Bueno de Mesquita, Anthony Fowler

An introduction to data science or statistics shouldn’t involve proving complex theorems or memorizing obscure terms and formulas, but that is exactly what most introductory quantitative textbooks emphasize. In contrast, Thinking Clearly with Data focuses, first and foremost, on critical thinking and conceptual understanding in order to teach students how to be better consumers and analysts of the kinds of quantitative information and arguments that they will encounter throughout their lives.

Among much else, the book teaches how to assess whether an observed relationship in data reflects a genuine relationship in the world and, if so, whether it is causal; how to make the most informative comparisons for answering questions; what questions to ask others who are making arguments using quantitative evidence; which statistics are particularly informative or misleading; how quantitative evidence should and shouldn’t influence decision-making; and how to make better decisions by using moral values as well as data.

- An ideal textbook for introductory quantitative methods courses in data science, statistics, political science, economics, psychology, sociology, public policy, and other fields
- Introduces the basic toolkit of data analysis―including sampling, hypothesis testing, Bayesian inference, regression, experiments, instrumental variables, differences in differences, and regression discontinuity
- Uses real-world examples and data from a wide variety of subjects
- Includes practice questions and data exercises

Link: https://www.amazon.com/Thinking-Clearly-Data-Quantitative-Reasoning/dp/0691214352

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #datascience #correlation #regression #causation #randomizedexperiments #statistics

@data_science_weekly
Machine Learning Engineering Online Book by Stas Bekman

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how Stas acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, he is working on developing/training open-source Retrieval Augmented models at Contextual.AI.

Table of Contents
Part 1. Insights
- The AI Battlefield Engineering - What You Need To Know
Part 2. Key Hardware Components
- Accelerator - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)
- Network - intra-node and inter-node connectivity, calculating bandwidth requirements
- IO - local and distributed disks and filesystems
- CPU - cpus, affinities (WIP)
- CPU Memory - how much CPU memory is enough - the shortest chapter ever.
Part 3. Performance
- Fault Tolerance
- Performance
- Multi-Node networking
- Model parallelism
Part 4. Operating
- SLURM
- Training hyper-parameters and model initializations
- Instabilities
Part 5. Development
- Debugging software and hardware failures
- And more debugging
- Reproducibility
- Tensor precision / Data types
- HF Transformers notes - making small models, tokenizers, datasets, and other tips
Part 6. Miscellaneous
- Resources - LLM/VLM chronicles

Link: https://github.com/stas00/ml-engineering

Navigational hashtags: #armknowledgesharing #armbooks #armrepo
General hashtags: #llm #gpt #gpt3 #gpt4 #ml #engineering #mlsystemdesign #systemdesign #reproducibility #performance

@data_science_weekly
What are embeddings? by Vicki Boykis

Over the past decade, embeddings — numerical representations of machine learning features used as input to deep learning models — have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google’s Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Link: https://vickiboykis.com/what_are_embeddings/index.html

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #dl #deeplearning #pytorch #embeddings #tfidf #svd #pca #word2vec #cbow #skipgram #bert #gpt #llm #transformers

@data_science_weekly
Supervised Machine Learning for Science. How to stop worrying and love your black box by Christoph Molnar & Timo Freiesleben

Machine learning has revolutionized science, from folding proteins and predicting tornadoes to studying human nature. While science has always had an intimate relationship with prediction, machine learning amplified this focus. But can this hyper-focus on prediction models be justified? Can a machine learning model be part of a scientific model? Or are we on the wrong track?

In this book, authors explore and justify supervised machine learning in science. However, a naive application of supervised learning won’t get you far because machine learning in raw form is unsuitable for science. After all, it lacks interpretability, uncertainty quantification, causality, and many more desirable attributes. Yet, we already have all the puzzle pieces needed to improve machine learning, from incorporating domain knowledge and ensuring the representativeness of the training data to creating robust, interpretable, and causal models. The problem is that the solutions are scattered everywhere.

In this book, authors bring together the philosophical justification and the solutions that make supervised machine learning a powerful tool for science.

The book consists of two parts:
- Part 1 discusses the relationship between science and machine learning.
- Part 2 addresses the shortcomings of supervised machine learning.

Link: https://ml-science-book.com/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #science #supervised

@data_science_weekly
Designing Machine Learning Systems by Chip Huyen

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems

Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems

@data_science_weekly
Interpretable Machine Learning. A Guide for Making Black Box Models Explainable by Christoph Molnar

Machine learning has great potential for improving products, processes and research. But computers usually do not explain their predictions which is a barrier to the adoption of machine learning. This book is about making machine learning models and their decisions interpretable.

After exploring the concepts of interpretability, you will learn about simple, interpretable models such as decision trees, decision rules and linear regression. The focus of the book is on model-agnostic methods for interpreting black box models such as feature importance and accumulated local effects, and explaining individual predictions with Shapley values and LIME. In addition, the book presents methods specific to deep neural networks.

All interpretation methods are explained in depth and discussed critically. How do they work under the hood? What are their strengths and weaknesses? How can their outputs be interpreted? This book will enable you to select and correctly apply the interpretation method that is most suitable for your machine learning project. Reading the book is recommended for machine learning practitioners, data scientists, statisticians, and anyone else interested in making machine learning models interpretable.

Link:
- Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #interpretation #explanation #interpretability #blackbox

@data_science_weekly
Mathematics for Machine Learning by Marc Peter Deisenroth and A. Aldo Faisal

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to efficiently learn the mathematics. This self contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines.

For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the first time, the methods help build intuition and practical experience with applying mathematical concepts.

Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book's web site.

Table of Contents
Part I: Mathematical Foundations
1. Introduction and Motivation
2. Linear Algebra
3. Analytic Geometry
4. Matrix Decompositions
5. Vector Calculus
6. Probability and Distribution
7. Continuous Optimization
Part II: Central Machine Learning Problems
8. When Models Meet Data
9. Linear Regression
10. Dimensionality Reduction with Principal Component Analysis
11. Density Estimation with Gaussian Mixture Models
12. Classification with Support Vector Machines

Link: Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #math #mathematics #maths #calculus #algebra #probability #geometry #optimization #machinelearning #ml

@data_science_weekly
Lessons in Statistical Thinking by Daniel Kaplan

One of the oft-stated goals of education is the development of “critical thinking” skills. Although it is rare to see a careful definition of critical thinking, widely accepted elements include framing and recognizing coherent arguments, the application of logic patterns such as deduction, the skeptical evaluation of evidence, consideration of alternative explanations, and a disinclination to accept unsubstantiated claims.

“Statistical thinking” is a variety of critical thinking involving data and inductive reasoning directed to draw reasonable and useful conclusions that can guide decision-making and action.

Surprisingly, many university statistics courses are not primarily about statistical reasoning. They do cover some technical methods used in statistical reasoning, but they have replaced notions of “useful,” “decision-making,” and “action” with doctrines such as “null hypothesis significance testing” and “correlation is not causation.” For example, a core method for drawing responsible conclusions about causal relationships by adjusting for “covariates” is hardly ever even mentioned in conventional statistics courses.

These Lessons in Statistical Thinking present the statistical ideas and methods behind decision-making to guide action.

Link: Direct Link

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #stats #statistics #math #maths

@data_science_weekly
What the f*ck Python! 😱

Python, being a beautifully designed high-level and interpreter-based programming language, provides us with many features for the programmer's comfort. But sometimes, the outcomes of a Python snippet may not seem obvious at first sight.

Here's a fun project attempting to explain what exactly is happening under the hood for some counter-intuitive snippets and lesser-known features in Python.

While some of the examples you see below may not be WTFs in the truest sense, but they'll reveal some of the interesting parts of Python that you might be unaware of. I find it a nice way to learn the internals of a programming language, and I believe that you'll find it interesting too!

If you're an experienced Python programmer, you can take it as a challenge to get most of them right in the first attempt. You may have already experienced some of them before, and I might be able to revive sweet old memories of yours! 😅

Links:
- Interactive Website
- Interactive Notebook
- GitHub Version:
- ENG
- RUS

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #python #programming #coding

@data_science_weekly
Data Analysis with Python and PySpark by Jonathan Rioux

In Data Analysis with Python and PySpark you will learn how to:

- Manage your data as it scales across multiple machines
- Scale up your data programs with full confidence
- Read and write data to and from a variety of sources and formats
- Deal with messy data with PySpark’s data manipulation functionality
- Discover new data sets and perform exploratory data analysis
- Build automated data pipelines that transform, summarize, and get insights from data
- Troubleshoot common PySpark errors
- Creating reliable long-running jobs

Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.

Link: Direct

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #spark #pyspark #bigdata

@data_science_weekly
Practical Recommender Systems by Kim Falk

Practical Recommender Systems explains how recommender systems work and shows how to create and apply them for your site. After covering the basics, you’ll see how to collect user data and produce personalized recommendations. You’ll learn how to use the most popular recommendation algorithms and see examples of them in action on sites like Amazon and Netflix. Finally, the book covers scaling problems and other issues you’ll encounter as your site grows.

Link: Direct

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #recsys #recommendersystems

@data_science_weekly
DevOps for Data Science by Alex K Gold

In this book, you’ll learn about DevOps conventions, tools, and practices that can be useful to you as a data scientist. You’ll also learn how to work better with the IT/Admin team at your organization, and even how to do a little server administration of your own if you’re pressed into service.

Link: Direct

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #devops #mlops #datascience

@data_science_weekly
Immersive linear algebra by J. Ström, K. Åström, and T. Akenine-Möller

"A picture says more than a thousand words" is a common expression, and for text books, it is often the case that a figure or an illustration can replace a large number of words as well. However, they believe that an interactive illustration can say even more, and that is why they have decided to build their linear algebra book around such illustrations. They believe that these figures make it easier and faster to digest and to learn linear algebra (which would be the case for many other mathematical books as well, for that matter). In addition, they have added some more features (e.g., popup windows for common linear algebra terms) to their book, and they believe that those features will make it easier and faster to read and understand as well.

After using linear algebra for 20 years times three persons, they were ready to write a linear algebra book that they think will make it substantially easier to learn and to teach linear algebra. In addition, the technology of mobile devices and web browsers have improved beyond a certain threshold, so that this book could be put together in a very novel and innovative way (they think). The idea is to start each chapter with an intuitive concrete example that practically shows how the math works using interactive illustrations. After that, the more formal math is introduced, and the concepts are generalized and sometimes made more abstract. They believe it is easier to understand the entire topic of linear algebra with a simple and concrete example cemented into the reader's mind in the beginning of each chapter.

Link: Book

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #math #linearalgebra #algebra

@data_science_weekly