Artem Ryblov’s Data Science Weekly
226 subscribers
61 photos
86 links
@artemfisherman’s Data Science Weekly: Elevate your expertise with a standout data science resource each week, carefully chosen for depth and impact.
Long-form content: https://artemryblov.substack.com
Download Telegram
Spinning Up in Deep RL by OpenAI

This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL).

For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning.

More about the course: https://www.youtube.com/watch?v=fdY7dt3ijgY&t=1s (OpenAI Spinning Up in Deep RL Workshop)

Link: https://spinningup.openai.com/en/latest/index.html

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #rl #reinforcementlearning #deeprl #openai #deeplearning #dl

@data_science_weekly
Thinking Clearly with Data: A Guide to Quantitative Reasoning and Analysis by Ethan Bueno de Mesquita, Anthony Fowler

An introduction to data science or statistics shouldn’t involve proving complex theorems or memorizing obscure terms and formulas, but that is exactly what most introductory quantitative textbooks emphasize. In contrast, Thinking Clearly with Data focuses, first and foremost, on critical thinking and conceptual understanding in order to teach students how to be better consumers and analysts of the kinds of quantitative information and arguments that they will encounter throughout their lives.

Among much else, the book teaches how to assess whether an observed relationship in data reflects a genuine relationship in the world and, if so, whether it is causal; how to make the most informative comparisons for answering questions; what questions to ask others who are making arguments using quantitative evidence; which statistics are particularly informative or misleading; how quantitative evidence should and shouldn’t influence decision-making; and how to make better decisions by using moral values as well as data.

- An ideal textbook for introductory quantitative methods courses in data science, statistics, political science, economics, psychology, sociology, public policy, and other fields
- Introduces the basic toolkit of data analysis―including sampling, hypothesis testing, Bayesian inference, regression, experiments, instrumental variables, differences in differences, and regression discontinuity
- Uses real-world examples and data from a wide variety of subjects
- Includes practice questions and data exercises

Link: https://www.amazon.com/Thinking-Clearly-Data-Quantitative-Reasoning/dp/0691214352

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #datascience #correlation #regression #causation #randomizedexperiments #statistics

@data_science_weekly
Channel name was changed to «Data Science Links»
The Illustrated Machine Learning

The idea is to make the complex world of Machine Learning more approachable through clear and concise illustrations.

The goal is to provide a visual aid for students, professionals, and anyone preparing for a technical interview to better understand the underlying concepts of Machine Learning.

Whether you're just starting out in the field or you're a seasoned professional looking to refresh your knowledge, these illustrations will be a valuable resource on your journey to understanding Machine Learning.

- Machine Learning
- Categorization
- Sampling and Resampling
- Bias/Variance
- Supervised Learning
- Unsupervised Learning
- Hyperparameters Tuning
- Machine Learning Engineering
- Introduction
- Before the Project Starts
- Data Collection and Preparation
- Projective Geometry
- Introduction
- Image Formation
- Structure from Motion
- Stereo Reconstruction
- Deep Learning Playbook

Link: https://illustrated-machine-learning.github.io/

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #machinelearning #ml #mlsystemdesign #machinelearningsystemdesign #geometry #visualization #illustrated #supervised #unsupervised #dl #deeplearning #bias #variance #biasvariance

@data_science_weekly
How to do a code review by Google

The pages in this section contain recommendations on the best way to do code reviews, based on long experience. All together, they represent one complete document, broken up into many separate sections. You don’t have to read them all, but many people have found it very helpful to themselves and their team to read the entire set.

- The Standard of Code Review
- What to Look For In a Code Review
- Navigating a CL in Review
- Speed of Code Reviews
- How to Write Code Review Comments
- Handling Pushback in Code Reviews

Link: https://google.github.io/eng-practices/review/reviewer/

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #computerscience #cs #codereview #coding #cl #changelist

@data_science_weekly
HarvardX: CS50's Introduction to Artificial Intelligence with Python

This course explores the concepts and algorithms at the foundation of modern artificial intelligence, diving into the ideas that give rise to technologies like game-playing engines, handwriting recognition, and machine translation. Through hands-on projects, students gain exposure to the theory behind graph search algorithms, classification, optimization, machine learning, large language models, and other topics in artificial intelligence as they incorporate them into their own Python programs. By course’s end, students emerge with experience in libraries for machine learning as well as knowledge of artificial intelligence principles that enable them to design intelligent systems of their own.

What you'll learn
- graph search algorithms
- adversarial search
- knowledge representation
- logical inference
- probability theory
- Bayesian networks
- Markov models
- constraint satisfaction
- machine learning
- reinforcement learning
- neural networks
- natural language processing

By the way, it starts today - December 14, 2023.

Links:
- https://www.edx.org/learn/artificial-intelligence/harvard-university-cs50-s-introduction-to-artificial-intelligence-with-python
- https://cs50.harvard.edu/ai/2024/

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #deeplearning #dl #graphs #reinforcementlearning #rl #neuralnetworks #nn #naturallanguageprocessing #nlp

@data_science_weekly
LLM University by Cohere

Their comprehensive curriculum aims to give you a rock-solid foundation in NLP, equipping you with the skills needed to develop your own applications. Whether you want to learn semantic search, generation, classification, embeddings, or any other NLP technique, this is the place for you! We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs), ensuring you can harness the full potential of LLMs. Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own models.

The Curriculum

In this course, you will learn everything about Large Language Models (LLMs), including:
- How do LLMs work?:
Learn about their architecture and their moving pieces, including transformer models, embeddings, similarity, and attention mechanisms.
- What are LLMs useful for?:
Learn about many real-world applications of LLMs, including:
- Semantic search
- Text generation
- Text classification
- Analyzing text using embeddings
- How can I use LLMs to build and deploy my apps?:
Learn how to use LLMs to build applications. This course will teach you:
- How to use Cohere's endpoints: Classify, Generate, and Embed.
- How to build apps, including semantic search models, text generators, etc.
- (Coming soon...) How to deploy these apps on many platforms.

Link: https://docs.cohere.com/docs/llmu

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #deeplearning #dl #transofrmers #transformer #llm #largelanguagemodels #largelanguagemodel #textgeneration #semanticsearch #classification #textclassification #embeddings

@data_science_weekly
Prompt Engineering Guide by Open.AI

This guide shares strategies and tactics for getting better results from large language models (sometimes referred to as GPT models) like GPT-4. The methods described here can sometimes be deployed in combination for greater effect. We encourage experimentation to find the methods that work best for you.

Some of the examples demonstrated here currently work only with our most capable model, gpt-4. In general, if you find that a model fails at a task and a more capable model is available, it's often worth trying again with the more capable model.

Link: https://platform.openai.com/docs/guides/prompt-engineering

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #llm #openai #prompts #promptengineering #gpt #gpt3 #gpt4

@data_science_weekly
Channel name was changed to «Data Science Weekly»
Machine Learning Engineering Online Book by Stas Bekman

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how Stas acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, he is working on developing/training open-source Retrieval Augmented models at Contextual.AI.

Table of Contents
Part 1. Insights
- The AI Battlefield Engineering - What You Need To Know
Part 2. Key Hardware Components
- Accelerator - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)
- Network - intra-node and inter-node connectivity, calculating bandwidth requirements
- IO - local and distributed disks and filesystems
- CPU - cpus, affinities (WIP)
- CPU Memory - how much CPU memory is enough - the shortest chapter ever.
Part 3. Performance
- Fault Tolerance
- Performance
- Multi-Node networking
- Model parallelism
Part 4. Operating
- SLURM
- Training hyper-parameters and model initializations
- Instabilities
Part 5. Development
- Debugging software and hardware failures
- And more debugging
- Reproducibility
- Tensor precision / Data types
- HF Transformers notes - making small models, tokenizers, datasets, and other tips
Part 6. Miscellaneous
- Resources - LLM/VLM chronicles

Link: https://github.com/stas00/ml-engineering

Navigational hashtags: #armknowledgesharing #armbooks #armrepo
General hashtags: #llm #gpt #gpt3 #gpt4 #ml #engineering #mlsystemdesign #systemdesign #reproducibility #performance

@data_science_weekly
The Incredible PyTorch

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch.

Table Of Contents
- Tutorials
- Large Language Models (LLMs)
- Tabular Data
- Visualization
- Explainability
- Object Detection
- Long-Tailed / Out-of-Distribution Recognition
- Activation Functions
- Energy-Based Learning
- Missing Data
- Architecture Search
- Continual Learning
- Optimization
- Quantization
- Quantum Machine Learning
- Neural Network Compression
- Facial, Action and Pose Recognition
- Super resolution
- Synthetesizing Views
- Voice
- Medical
- 3D Segmentation, Classification and Regression
- Video Recognition
- Recurrent Neural Networks (RNNs)
- Convolutional Neural Networks (CNNs)
- Segmentation
- Geometric Deep Learning: Graph & Irregular Structures
- Sorting
- Ordinary Differential Equations Networks
- Multi-task Learning
- GANs, VAEs, and AEs
- Unsupervised Learning
- Adversarial Attacks
- Style Transfer
- Image Captioning
- Transformers
- Similarity Networks and Functions
- Reasoning
- General NLP
- Question and Answering
- Speech Generation and Recognition
- Document and Text Classification
- Text Generation
- Text to Image
- Translation
- Sentiment Analysis
- Deep Reinforcement Learning
- Deep Bayesian Learning and Probabilistic Programmming
- Spiking Neural Networks
- Anomaly Detection
- Regression Types
- Time Series
- Synthetic Datasets
- Neural Network General Improvements
- DNN Applications in Chemistry and Physics
- New Thinking on General Neural Network Architecture
- Linear Algebra
- API Abstraction
- Low Level Utilities
- PyTorch Utilities
- PyTorch Video Tutorials
- Community
- To be Classified
- Links to This Repository
- Contributions

Link: The Incredible PyTorch (repository)

Navigational hashtags: #armknowledgesharing #armrepo
General hashtags: #dl #deeplearning #pytorch

@data_science_weekly
What are embeddings? by Vicki Boykis

Over the past decade, embeddings — numerical representations of machine learning features used as input to deep learning models — have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google’s Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Link: https://vickiboykis.com/what_are_embeddings/index.html

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #dl #deeplearning #pytorch #embeddings #tfidf #svd #pca #word2vec #cbow #skipgram #bert #gpt #llm #transformers

@data_science_weekly
The CL (changelist) author’s guide to getting through code review by Google

The pages in this section contain best practices for developers going through code review. These guidelines should help you get through reviews faster and with higher-quality results. You don’t have to read them all, but they are intended to apply to every Google developer, and many people have found it helpful to read the whole set.

- Writing Good CL Descriptions
- Small CLs
- How to Handle Reviewer Comments

Link: https://google.github.io/eng-practices/review/developer/

Navigational hashtags: #armknowledgesharing #armtutorials
General hashtags: #git #commit #pr #changelist #cl #review #pullrequest

@data_science_weekly
Google Machine Learning Education

Learn to build ML products with Google's Machine Learning Courses.

Foundational courses
The foundational courses cover machine learning fundamentals and core concepts. They recommend taking them in the order below.

1. Introduction to Machine Learning
A brief introduction to machine learning.
2. Machine Learning Crash Course
A hands-on course to explore the critical basics of machine learning.
3. Problem Framing
A course to help you map real-world problems to machine learning solutions.
4. Data Preparation and Feature Engineering
An introduction to preparing your data for ML workflows.
5. Testing and Debugging
Strategies for testing and debugging machine learning models and pipelines.

Advanced Courses
The advanced courses teach tools and techniques for solving a variety of machine learning problems. The courses are structured independently. Take them based on interest or problem domain.

- Decision Forests
Decision forests are an alternative to neural networks.
- Recommendation Systems
Recommendation systems generate personalized suggestions.
- Clustering
Clustering is a key unsupervised machine learning strategy to associate related items.
- Generative Adversarial Networks
GANs create new data instances that resemble your training data.
- Image Classification
Is that a picture of a cat or is it a dog?
- Fairness in Perspective API
Hands-on practice debugging fairness issues.

Guides
Their guides offer simple step-by-step walkthroughs for solving common machine learning problems using best practices.

- Rules of ML
Become a better machine learning engineer by following these machine learning best practices used at Google.
- People + AI Guidebook
This guide assists UXers, PMs, and developers in collaboratively working through AI design topics and questions.
- Text Classification
This comprehensive guide provides a walkthrough to solving text classification problems using machine learning.
- Good Data Analysis
This guide describes the tricks that an expert data analyst uses to evaluate huge data sets in machine learning problems.
- Deep Learning Tuning Playbook
This guide explains a scientific way to optimize the training of deep learning models.

Link: https://developers.google.com/machine-learning?hl=en

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #machinelearning #ml #google #course #courses #featureengineering #recsys #clustering #gan

@data_science_weekly
Supervised Machine Learning for Science. How to stop worrying and love your black box by Christoph Molnar & Timo Freiesleben

Machine learning has revolutionized science, from folding proteins and predicting tornadoes to studying human nature. While science has always had an intimate relationship with prediction, machine learning amplified this focus. But can this hyper-focus on prediction models be justified? Can a machine learning model be part of a scientific model? Or are we on the wrong track?

In this book, authors explore and justify supervised machine learning in science. However, a naive application of supervised learning won’t get you far because machine learning in raw form is unsuitable for science. After all, it lacks interpretability, uncertainty quantification, causality, and many more desirable attributes. Yet, we already have all the puzzle pieces needed to improve machine learning, from incorporating domain knowledge and ensuring the representativeness of the training data to creating robust, interpretable, and causal models. The problem is that the solutions are scattered everywhere.

In this book, authors bring together the philosophical justification and the solutions that make supervised machine learning a powerful tool for science.

The book consists of two parts:
- Part 1 discusses the relationship between science and machine learning.
- Part 2 addresses the shortcomings of supervised machine learning.

Link: https://ml-science-book.com/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearning #ml #science #supervised

@data_science_weekly
Large Language Model Course

The LLM course is divided into three parts:

🧩 LLM Fundamentals covers essential knowledge about mathematics, Python, and neural networks.
🧑‍🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques.
👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.

Links:
- Direct link
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #llm #llms #largelanguagemodel #largelanguagemodels #transofrmer #transformers #deeplearning #dl #nlp #naturallanguageprocessing

@data_science_weekly
Exceptional Resources for Data Science Interview Preparation. Part 1: Live Coding

In this article, we will understand what a live coding interview is and how to prepare for it.

This blog-post will primarily be useful to Data Scientists and ML engineers, while some sections, for example, Algorithms and Data Structures, will be suitable for all IT specialists who will have to go through the live coding section.

Table of contents
- Preparing for an Algorithmic Interview
- Resources
- Algorithms and Data Structures
- Programming in Python
- Solving a Practical Data Science Problem
- Hybrid
- Learning How to Learn
- Let’s sum it up
- What’s next?

NB:
I'm the author of the article.
It was initially published in Russian (on habr.com), then I added additional resources in English to make up for deleting resources in Russian language and published it on medium.com.
So, for Russian speakers I recommend to read Russian version, for English speakers I recommend to read English version and both will benefit from starring the repository, which will be maintained and updated when new resources become available.

Links:
- Medium (eng)
- Habr (rus)

Navigational hashtags: #armknowledgesharing #armarticles
General hashtags: #interview #interviewpreparation #livecoding #leetcode #algorithms #algorithmsdatastructures #datastructures #python #sql #kaggle

@data_science_weekly
System Design
Learn how to design systems at scale and prepare for system design interviews

What is system design?
System design is the process of defining the architecture, interfaces, and data for a system that satisfies specific requirements. System design meets the needs of your business or organization through coherent and efficient systems. It requires a systematic approach to building and engineering systems. A good system design requires us to think about everything, from infrastructure all the way down to the data and how it's stored.

Table of contents

- Getting Started
What is system design?
- Chapter I
IP, OSI Model, TCP and UDP, Domain Name System (DNS), Load Balancing, Clustering, Caching, Content Delivery Network (CDN), Proxy, Availability, Scalability, Storage
- Chapter II
Databases and DBMS, SQL databases, NoSQL databases, SQL vs NoSQL databases, Database Replication, Indexes, Normalization and Denormalization, ACID and BASE consistency models, CAP theorem, PACELC Theorem, Transactions, Distributed Transactions, Sharding, Consistent Hashing, Database Federation
- Chapter III
N-tier architecture, Message Brokers, Message Queues, Publish-Subscribe, Enterprise Service Bus (ESB), Monoliths and Microservices, Event-Driven Architecture (EDA), Event Sourcing, Command and Query Responsibility Segregation (CQRS), API Gateway, REST, GraphQL, gRPC, Long polling, WebSockets, Server-Sent Events (SSE)
- Chapter IV
Geohashing and Quadtrees, Circuit breaker, Rate Limiting, Service Discovery, SLA, SLO, SLI, Disaster recovery, Virtual Machines (VMs) and Containers, OAuth 2.0 and OpenID Connect (OIDC), Single Sign-On (SSO), SSL, TLS, mTLS
- Chapter V
System Design Interviews, URL Shortener, WhatsApp, Twitter, Netflix, Uber
- Appendix
Next Steps, References

Links:
- Direct link to the site with the course
- Direct link to the repository for the course
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armcourses
General hashtags: #systemdesign

@data_science_weekly
Prompt Engineering Guide

Generative AI is the world's hottest buzzword, and they have created the most comprehensive (and free) guide on how to use it. This course is tailored to non-technical readers, who may not have even heard of AI, making it the perfect starting point if you are new to Generative AI and Prompt Engineering. Technical readers will find valuable insights within their later modules.

Generative AI refers to tools that can be used to create new content such as articles or images, just like humans can. It is expected to significantly change the way we work (read: your job may be affected). With so much buzz floating around about Generative AI (Gen AI) and Prompt Engineering (PE), it is hard to know what to believe.

They have scoured the internet to find the best techniques and tools for their 1.3 Million readers from companies like OpenAI, Brex, and Deloitte. They are constantly refining their guide, to ensure that they provide you with the latest information.

Link:
- Direct link to the site with the guide
- Content Guide link
- Topic Guide link

Navigational hashtags: #armknowledgesharing #armtutorial
General hashtags: #promptengineering #prompt #prompting #genai #generativeai

@data_science_weekly
Designing Machine Learning Systems by Chip Huyen

Machine learning systems are both complex and unique. Complex because they consist of many different components and involve many different stakeholders. Unique because they're data dependent, with data varying wildly from one use case to the next. In this book, you'll learn a holistic approach to designing ML systems that are reliable, scalable, maintainable, and adaptive to changing environments and business requirements.

Author Chip Huyen, co-founder of Claypot AI, considers each design decision--such as how to process and create training data, which features to use, how often to retrain models, and what to monitor--in the context of how it can help your system as a whole achieve its objectives. The iterative framework in this book uses actual case studies backed by ample references.

This book will help you tackle scenarios such as:
- Engineering data and choosing the right metrics to solve a business problem
- Automating the process for continually developing, evaluating, deploying, and updating models
- Developing a monitoring system to quickly detect and address issues your models might encounter in production
- Architecting an ML platform that serves across use cases
- Developing responsible ML systems

Link: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

Navigational hashtags: #armknowledgesharing #armbooks
General hashtags: #machinelearningsystemdesign #systemdesign #machinelearning #ml #designingmachinelearningsystems

@data_science_weekly