Machine Learning
39.4K subscribers
4.35K photos
40 videos
50 files
1.42K links
Real Machine Learning โ€” simple, practical, and built on experience.
Learn step by step with clear explanations and working code.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
๐ŸŒ Global, Local, Sparse: Attention Patterns in Long-Context Transformers

The O(nยฒ) complexity of dense (global) attention is impractical for long sequences. Here's what ML engineers need to know about the three dominant patterns: ๐Ÿง โš™๏ธ

1๏ธโƒฃ Global (Full Dense) ๐ŸŒ
โžœ Every token attends to every token.
โžœ A = softmax(QKแต€ / โˆšd) V
โžœ Complexity: O(nยฒd)
โžœ Use: Short contexts (<4k) or precise recall tasks. ๐ŸŽฏ
โžœ Downside: KV cache memory explodes. ๐Ÿ’ฅ

2๏ธโƒฃ Local (Sliding Window) โ€“ e.g., Mistral ๐ŸชŸ
โžœ Tokens attend to a fixed neighborhood (ยฑ512).
โžœ Complexity: O(n ยท w)
โžœ Use: Streaming text, audio, DNA. ๐ŸŽง๐Ÿงฌ
โžœ Trade-off: Linear scaling but zero long-range mixing between windows. ๐Ÿ”„

3๏ธโƒฃ Sparse โ€“ e.g., BigBird, Longformer ๐Ÿ•ธ
โžœ Pattern: Local + Global (e.g., [CLS] tokens) + Random/strided.
โžœ Complexity: O(n ยท (w + g + r)) โ‰ˆ O(n)
โžœ Use: Document summarization (5kโ€“16k tokens). ๐Ÿ“
โžœ Insight: Sparse graphs preserve universal approximation if graph diameter is bounded. ๐Ÿ”—

Where we're going: Static sparsity is losing to dynamic routing (Mixture of Depths, 2024). ๐Ÿš€ Also, linear RNN-like attention (Mamba, RWKV) challenges whether we need any static pattern. ๐Ÿค”

https://t.me/MachineLearning9 ๐Ÿ˜ก
Please open Telegram to view this post
VIEW IN TELEGRAM
โค3