Artem Ryblov’s Data Science Weekly

Machine Learning Engineering Online Book by Stas Bekman

An open collection of methodologies to help with successful training of large language models and multi-modal models.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how Stas acquired while training the open-source BLOOM-176B model in 2022 and IDEFICS-80B multi-modal model in 2023. Currently, he is working on developing/training open-source Retrieval Augmented models at Contextual.AI.

Table of Contents
Part 1. Insights
- The AI Battlefield Engineering - What You Need To Know
Part 2. Key Hardware Components
- Accelerator - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)
- Network - intra-node and inter-node connectivity, calculating bandwidth requirements
- IO - local and distributed disks and filesystems
- CPU - cpus, affinities (WIP)
- CPU Memory - how much CPU memory is enough - the shortest chapter ever.
Part 3. Performance
- Fault Tolerance
- Performance
- Multi-Node networking
- Model parallelism
Part 4. Operating
- SLURM
- Training hyper-parameters and model initializations
- Instabilities
Part 5. Development
- Debugging software and hardware failures
- And more debugging
- Reproducibility
- Tensor precision / Data types
- HF Transformers notes - making small models, tokenizers, datasets, and other tips
Part 6. Miscellaneous
- Resources - LLM/VLM chronicles

Link: https://github.com/stas00/ml-engineering

Navigational hashtags: #armknowledgesharing #armbooks #armrepo
General hashtags: #llm #gpt #gpt3 #gpt4 #ml #engineering #mlsystemdesign #systemdesign #reproducibility #performance

@data_science_weekly

529 viewsArtem Ryblov, edited 08:57

About

Blog

Apps

Platform