the last neural cell

ML papers | 01-13 June 2023

💎

Video + Text

Probabilistic Adaptation of Text-to-Video Models

What: Finetune large pretrain text to video model on small domain specific videos.

Complicated but interesting. You can finetune pretrain diffusion model on your domain with small additional block.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

What: Finetune LLM for understanding video+audio.

Use Q-Former for getting audio and video features. Then add it to pretrained llama model.

🧬

Diffusion

Iterative α-(de)Blending: a Minimalist Deterministic Diffusion Model

What: propose simple implementation and intuition of diffusion model.

Good start to dive into the field and try on your data.

💎

Audio Transformers

Simple and Controllable Music Generation

What: propose decoder for text 2 audio based on latent audio features.

They use vq quantization. Check it if you don't hear about it.
It allows to represent data with a limited number of vectors.

💎If you like this format please write in comments.
#digest

Please open Telegram to view this post

VIEW IN TELEGRAM

❤9👍2🔥2🤩2🦄1

3.18K viewsAlexander Kovalev, edited 18:01

the last neural cell

🧬

Tasty papers | 13-20 June 2023

Multimodal

🟣LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Add visual information to LLM using trainable adapters.

Expand LLaMA Adapters V1 to vision.
+ Apply early fusion for visual tokens.
+ Add calibration of norm, bias of the LLM model.
+ Finetune on image-text dataset.

Audio

🟣High-Fidelity Audio Compression with Improved RVQGAN

Compress natural audio to discrete tokens with VQ technique.

Train universal compression model on all audio data: speech, music, noise.
+ add vector quantization.
+ add adversarial loss (GAN loss).

🟣Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Audio generative "diffusion" model trained on 50k hours data.

Use Flow Matching, similar w/ diffusion, but better ✌
Masked train setting with context information. The model can synthesize speech, noise removal, content editing,

Neuro

🟢

Decoding and synthesizing tonal language speech from brain activity

Decode tonal language from ECoG data with CNN-LSTM models.

Adapt multi-stream model -> looks unnecessary complicated.
Record small datasets. Overall 10 minutes per patient for 8 different syllables.

#digest

Please open Telegram to view this post

VIEW IN TELEGRAM

❤3🦄3🔥1

2.02K viewsAlexander Kovalev, edited 16:54

the last neural cell

🧬

Tasty AI papers | 01-31 July 2024

💎

Vision models

Genie: Generative Interactive Environments

What: learn latent actions from videos (only) of games.
- predict future frames based on previous and latent actions.
- they trained actions to help model make transition between frames.
- just let’s AI model figures out commands by yourself.

SAM 2: Segment Anything in Images and Videos

What: SAM now works well with videos.
- annotate big dataset of videos.
- add memory block to ensure temporal consistency of predicted mask.

💎

General

Mixture of A Million Experts

What: expand MoE for lots of experts.
- store low rank approx of experts.
- works better than dense FFN.

The Road Less Scheduled

What: propose schedule-free optimizer.
- one more thing that beats AdamW.
- easy to drop in your training pipeline.

🔘

Diffusion

Rolling Diffusion Models

What: incorporating temporal info in generative diffusion process for videos.
- let’s make denoising and predict next frames at the same time.
- hard math, but idea is interesting.

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

What: step into merging local and global planning.

Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories.

#digest

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥4👍2🐳1

736 viewsAleksandr Kovalev, edited 17:02

the last neural cell

Tasty Neuro Papers | 01-31 July 2024

Brain decoding

🔘

Towards a "universal translator" for neural dynamics at single-cell, single-spike resolution

Кратко: pretrained transformer for spikes.

- Single-spike resolution, никакого rate coding.( На самом деле есть, но на маленьких бинах)
- Придумали multi task masking MtM - модель учится, попеременно маскируя и восстанавливая активность во времени, между нейронами и областями мозга.
- Обучаемый токен подсказывает модели текущую схему маскирования.

🔘

Semantic encoding during language comprehension at single-cell resolution

Кратко: Нашли нейроны, активирующиеся на конкретные значения слов во время речи.

- Реагируют на конкретные семантические категории слов (еда, животные и т.д.)
- Активность этих нейронов зависит от контекста предложения, а не просто от звучания слов
- Нейронная сеть может предсказать значение слова по активности этих клеток

Single-neuronal recordings were obtained from the prefrontal cortex of the language-dominant hemisphere in a region centred along the left posterior middle frontal gyrus

Мысли вслух. Скоро сделаем обзор на первую статьи(universal translator) и сравним с предыдущей "foundation" моделью POYO.
Мне лично нравится тенденция использовать все события(каждый спайк). Потому что при rate coding мы например имеем задержку и не можем некоторые быстро изменяющиеся вещи улавливать(саккады).

Соберите больше данных, а модель сама разберется -> пока что работает почти везде.

#digest

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥4🕊1👻1

1.02K viewsAleksandr Kovalev, edited 16:30

the last neural cell

Transformers for brain decoding | foundational models

Хочу рассказать про модели, которые сейчас используют для расшифровки мозговых сигналов (спайки, LFP). Посмотрим, как их обучают на данных с разных сессий и животных, какие подходы к предобучению применяют, и какие архитектуры в ходу. Я выделил три интересные статьи, кратко про каждую.

🔘

POYO-1: A Unified, Scalable Framework for Neural Population Decoding
perciever io, где токены это отдельные спайки, обучали с учителем на разных животных, решая разные задачи.

🔘

Neural Data Transformer 2: Multi-context Pretraining for Neural Spiking Activity
Адаптировали masked autoencoder (MAE). Плюс также добавляют инфу о сессии и о испытуемом. MAE это круто и просто. Вот ссылка на наш обзор

🔘

Towards a "universal translator" for neural dynamics at single-cell, single-spike resolution
Расширили прошлый подход и сделали более умный претрейн, начали добавлять ещё токены о типе маскирования. Показали что стало лучше.

Какой тренд мы видим. Multi task, multi subject, multi sessions, multi multi. Transformers go brr... Короче говоря, берут трансформер и хотят чтобы он решал всё и для всех.

Про каждую модель будет пост. Чётко разберемся что за данные, как их предобрабатывпли, какую модель использовали и что решали.

Везде данные разные да и сравнивать side by side пока тяжеловато. Всё это больше для ознакомления, о том как можно работать с данными. Так что вдохновляйтесь в своей работе)

Just my thought

Трансформер работает с векторами. Поэтому чтобы туда запихнуть наши нейро данные, их нужно вначале в эти вектора превратить. Однако что считать токеном для нейро активности? Отдельные спайки, binned activity, группу нейронов и т.д. Это вопрос открытый. Можно по-разному. Но вот что если сначала сжать информацию? И использовать более полезные токены из нашего “сжимателя”? Пример VQVAE который сейчас для всех аудио задач используется и для картинок, видео тоже. Чем нейро хуже?)

P.S. Если знаете ещё интересные статьи, где работают с intracortical activity. Пожалуйста скиньте. Тоже разберем)

#digest

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

2🔥8👏3🤔2

826 viewsAleksandr Kovalev, 14:33

the last neural cell

tasty transformer papers - september 2024

Emu3: Next-Token Prediction is All You Need
what: one transformer-decoder to generate videos.
- use vqvae to tokenize images.
- Also, [EOL] and [EOF] are inserted into the vision tokens to denote line breaks and frame
- training sample:
[BOS] {caption text} [SOV] {meta text} [SOT] {vision tokens} [EOV] [EOS].

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
what: mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction
- parallel decoding: 8 tokens [t-1] -> 8 tokens [t]
- audio response as fast as text generation.

Were RNNs All We Needed?
what: do not care a lot about architecture. they seems work similar.
- make rnn faster and see that performance is similar.

my thoughts
- data is key in model training. whether you're using transformers, rnns, next-token prediction, or diffusion models is becoming less important.
- predicting text and audio in parallel is promising for next-gen brain-computer interfaces.
- focus less on architecture and more on the quality of data and the objectives you want to optimize.

#digest

👍5❤2

386 viewsAleksandr Kovalev, edited 14:48

the last neural cell

tasty diffusion papers - september 2024

OmniGen: Unified Image Generation
what: one transformer for the text-to-image diffusion model.
- rectified flow.
- multimodal condition: text and image.
- one model processes all context and does diffusion steps.

Diffusion Policy Policy Optimization
what: set of best practices for fine-tuning diffusion-based policies in continuous control and robot learning tasks.

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
what: modification which helps to generalize better on few samples.

my thoughts
- i like how AI community is trying to simplify everything. surprisingly sometimes it works well. for example, rectified flow is simplified version of diffusion.
- rl + diffusion => next step in brain stimulation?
diffusion models could mimic brain patterns for smoother stimulation. with RL, they’d adapt in real-time, making treatments more precise and personalized. shifting from rigid protocols to dynamic brain interventions.

#digest

🔥3❤1

478 viewsAleksandr Kovalev, edited 04:44

About

Blog

Apps

Platform