SATOSHI ° NOSTR ° AI LLM ML RL ° LINUX ° MESH IoT ° BUSINESS ° OFFGRID ° LIFESTYLE | HODLER TUTORIAL
hanlab.mit.edu
How Attention Sinks Keep Language Models Stable
We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park…