Vol Building AGI

https://nips.cc/virtual/2023/events/journal_track_2023

My VAE paper is on NeurIPS Journal Track. See you in New Orleans!

👍2🔥1

56 views16:32

Vol Building AGI

RNNs are officially back. This paper is such a good read and the experiments are actually serious, we've been doing 350M models so far with our fast LSTM on 6B token runs.

Recipe for success: use VERY large hidden state, do not materialize activations early, use CUTLASS, do final runs on 300B tokens.

https://arxiv.org/abs/2312.00752

64 viewsedited 15:39

Vol Building AGI

Ha, selective scan is not using CUTLASS, it's written using cub! https://github.com/state-spaces/mamba/blob/main/csrc/selective_scan/selective_scan_fwd_kernel.cuh

GitHub

mamba/csrc/selective_scan/selective_scan_fwd_kernel.cuh at main · state-spaces/mamba

Mamba SSM architecture. Contribute to state-spaces/mamba development by creating an account on GitHub.

62 views16:30

Vol Building AGI

But what about BERT? Monarch matrices have you covered https://www.youtube.com/live/IS59IwGLvVs?si=3yvBYGsOSx3jU2tE

YouTube

Monarch Mixer: Making Foundation Models More Efficient - Dan Fu | Stanford MLSys #86

Episode 86 of the Stanford MLSys Seminar Series!

Monarch Mixer: Making Foundation Models More Efficient
Speaker: Dan Fu

Abstract:
Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts…

66 views21:58

Vol Building AGI

Got a poster

🔥5

72 views13:52

Vol Building AGI

Word2vec received a test of time award at NeurIPS https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

83 views01:53

Vol Building AGI

Flash attention was inspired by a block nested loop join from databases, and online softmax is inspired by algebraic aggregations (Jim Gray 2007)

86 views14:58

Vol Building AGI

Sepp is about to drop xLSTM

91 views17:18

Vol Building AGI

Happy New Year!I have started curating a collection of new year speeches. Try talking to one!

https://wilab.org.ua/watch/

❤1

90 views20:42

Vol Building AGI

Mamba is the first large scale-trained RNN that speaks Ukrainian. It's a large architectural shift from transformers: RNNs use constant memory and have linear complexity (down from quadratic!) when generating sequences.

RNNs is a recipe to get efficient machine learning systems on the end-user device.

I've forked it so you can run it on CPU (tested on Mac) without cuda dependencies:



git clone https://github.com/proger/mamba-cpu
cd mamba-cpu
pip install -e .
python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-130m" --prompt "Мій пес написав цей код на Python і вийшло таке:" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

👍3

652 viewsedited 11:08

Vol Building AGI

Forwarded from In Tensor We Trust

Вийшла нова версія mlx — 0.0.7, лінк на реліз.

Для тих, хто не знає, що це таке — це як torch, але для процесорів Apple.

На фотографіях все, що було зроблено.

Підтримка формату моделей safetensor — великий лайк 👍🏻, бо, знаєте самі, безпека понад усе.

Практичне використання просте:

— ML інженерія на макбуках;
— Інференс моделей;
— Маємо змогу на макбуках офлайново файн-тюнити моделі у клієнта (100% on-device ML).

86 views09:48

Vol Building AGI

Awni Hannun (author of mlx) is also known as the first author of Deep Speech — the first MLP-LSTM large scale speech recognizer trained with CTC on many GPUs on 10k+ hours of data. The data was not that abundant so the team had to hire people to read books.

82 viewsedited 09:51

Vol Building AGI

My best memory of references to his work was my first year at KPI where I took a course from Prof Marchenko. He mentioned Algorithms + Data Structures = Programs book as recommended reading. I hated the course at the time — we were taught to describe algorithms in an arcane graphical syntax and then code it in Turbo Pascal. I rejected Pascal because I thought nobody wrote serious programs in it — I had never been exposed to high quality Pascal code at the time, all interesting code I’ve seen was written in C for various Unix operating systems. I dropped out of school entirely within a year.

67 views18:45

Vol Building AGI

I took Master’s level courses last year and experienced pressure to stay less and less grounded in the machine codes. On a piece of paper changing one symbol changes turns the entire computation upside down, while doing so with a computer requires rewriting all of your code. When working through an idea you first imagine it, lay it down in natural language and then slowly formalize.

Turns out this method is called step-wise refinement — a program development technique that Niklas Wirth, an ETH Professor, has published in 1971. That’s the paper I’ll be reading today.

http://pascal.hansotten.com/uploads/wirth/Program%20development%20by%20step-wise%20refinement%20jan%201971%20002.pdf

78 viewsedited 18:45

Vol Building AGI

For every GPU, a group of 32 threads is called a *warp*.
Threads in a warp have an efficient lock-free synchronous communication method called a *shuffle*.

On this screenshot a shfl_up_sync intrinsic is used to simultaneously send a value of the register file (`acc`) up to the 2**e-th neighbour five times, simulating a propagation down a binary tree.

The next figure (from Using CUDA Warp-Level Primitives by NVIDIA) illustrates the same concept using a down-shuffle on a mini-warp of 8 threads. A thread inside a warp is also called a "lane".

78 views13:52

Vol Building AGI

93 views13:52

Vol Building AGI

102 views13:52

Vol Building AGI

Optimizing parallel deep learning systems is a bit like navigating Tokyo by public transit

93 viewsedited 11:40

Vol Building AGI

RWKV scaled to 1T tokens seems to beat Mistral trained on 8 on some multilingual benchmarks

Zero shot translation to Ukrainian in Eagle is about the same as Mistral in 2-shot setting and fine tuned llama2 with 10k examples.

https://twitter.com/RWKV_AI/status/1751797147492888651

X (formerly Twitter)

RWKV (@RWKV_AI) on X

Introducing Eagle-7B

Based on the RWKV-v5 architecture, bringing into opensource space, the strongest
- multi-lingual model
(beating even mistral)
- attention-free transformer today
(10-100x+ lower inference)

With comparable English performance with…

🔥2

120 viewsedited 12:42

Vol Building AGI

No reason to use transformer decoders any more for LLMs :)

109 views12:45

About

Blog

Apps

Platform