๐ Global, Local, Sparse: Attention Patterns in Long-Context Transformers
The O(nยฒ) complexity of dense (global) attention is impractical for long sequences. Here's what ML engineers need to know about the three dominant patterns: ๐ง โ๏ธ
1๏ธโฃ Global (Full Dense) ๐
โ Every token attends to every token.
โ A = softmax(QKแต / โd) V
โ Complexity: O(nยฒd)
โ Use: Short contexts (<4k) or precise recall tasks. ๐ฏ
โ Downside: KV cache memory explodes. ๐ฅ
2๏ธโฃ Local (Sliding Window) โ e.g., Mistral ๐ช
โ Tokens attend to a fixed neighborhood (ยฑ512).
โ Complexity: O(n ยท w)
โ Use: Streaming text, audio, DNA. ๐ง๐งฌ
โ Trade-off: Linear scaling but zero long-range mixing between windows. ๐
3๏ธโฃ Sparse โ e.g., BigBird, Longformer ๐ธ
โ Pattern: Local + Global (e.g., [CLS] tokens) + Random/strided.
โ Complexity: O(n ยท (w + g + r)) โ O(n)
โ Use: Document summarization (5kโ16k tokens). ๐
โ Insight: Sparse graphs preserve universal approximation if graph diameter is bounded. ๐
Where we're going: Static sparsity is losing to dynamic routing (Mixture of Depths, 2024). ๐ Also, linear RNN-like attention (Mamba, RWKV) challenges whether we need any static pattern. ๐ค
https://t.me/MachineLearning9๐ก
The O(nยฒ) complexity of dense (global) attention is impractical for long sequences. Here's what ML engineers need to know about the three dominant patterns: ๐ง โ๏ธ
1๏ธโฃ Global (Full Dense) ๐
โ Every token attends to every token.
โ A = softmax(QKแต / โd) V
โ Complexity: O(nยฒd)
โ Use: Short contexts (<4k) or precise recall tasks. ๐ฏ
โ Downside: KV cache memory explodes. ๐ฅ
2๏ธโฃ Local (Sliding Window) โ e.g., Mistral ๐ช
โ Tokens attend to a fixed neighborhood (ยฑ512).
โ Complexity: O(n ยท w)
โ Use: Streaming text, audio, DNA. ๐ง๐งฌ
โ Trade-off: Linear scaling but zero long-range mixing between windows. ๐
3๏ธโฃ Sparse โ e.g., BigBird, Longformer ๐ธ
โ Pattern: Local + Global (e.g., [CLS] tokens) + Random/strided.
โ Complexity: O(n ยท (w + g + r)) โ O(n)
โ Use: Document summarization (5kโ16k tokens). ๐
โ Insight: Sparse graphs preserve universal approximation if graph diameter is bounded. ๐
Where we're going: Static sparsity is losing to dynamic routing (Mixture of Depths, 2024). ๐ Also, linear RNN-like attention (Mamba, RWKV) challenges whether we need any static pattern. ๐ค
https://t.me/MachineLearning9
Please open Telegram to view this post
VIEW IN TELEGRAM
โค3