Machine Learning

🌐 Global, Local, Sparse: Attention Patterns in Long-Context Transformers

The O(n²) complexity of dense (global) attention is impractical for long sequences. Here's what ML engineers need to know about the three dominant patterns: 🧠⚙️

1️⃣ Global (Full Dense) 🌍
➜ Every token attends to every token.
➜ A = softmax(QKᵀ / √d) V
➜ Complexity: O(n²d)
➜ Use: Short contexts (<4k) or precise recall tasks. 🎯
➜ Downside: KV cache memory explodes. 💥

2️⃣ Local (Sliding Window) – e.g., Mistral 🪟
➜ Tokens attend to a fixed neighborhood (±512).
➜ Complexity: O(n · w)
➜ Use: Streaming text, audio, DNA. 🎧🧬
➜ Trade-off: Linear scaling but zero long-range mixing between windows. 🔄

3️⃣ Sparse – e.g., BigBird, Longformer 🕸
➜ Pattern: Local + Global (e.g., [CLS] tokens) + Random/strided.
➜ Complexity: O(n · (w + g + r)) ≈ O(n)
➜ Use: Document summarization (5k–16k tokens). 📝
➜ Insight: Sparse graphs preserve universal approximation if graph diameter is bounded. 🔗

Where we're going: Static sparsity is losing to dynamic routing (Mixture of Depths, 2024). 🚀 Also, linear RNN-like attention (Mamba, RWKV) challenges whether we need any static pattern. 🤔

https://t.me/MachineLearning9

😡

Please open Telegram to view this post

VIEW IN TELEGRAM

❤3

322 views15:20

About

Blog

Apps

Platform