Attention
updated
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published • 302
Lizard: An Efficient Linearization Framework for Large Language Models
Paper
• 2507.09025
• Published • 19
On the Expressiveness of Softmax Attention: A Recurrent Neural Network
Perspective
Paper
• 2507.23632
• Published • 6
Causal Attention with Lookahead Keys
Paper
• 2509.07301
• Published • 21
Hybrid Architectures for Language Models: Systematic Analysis and Design
Insights
Paper
• 2510.04800
• Published • 37
Less is More: Recursive Reasoning with Tiny Networks
Paper
• 2510.04871
• Published • 511
Why Low-Precision Transformer Training Fails: An Analysis on Flash
Attention
Paper
• 2510.04212
• Published • 26
Reactive Transformer (RxT) -- Stateful Real-Time Processing for
Event-Driven Reactive Language Models
Paper
• 2510.03561
• Published • 25
Every Attention Matters: An Efficient Hybrid Architecture for
Long-Context Reasoning
Paper
• 2510.19338
• Published • 117
Kimi Linear: An Expressive, Efficient Attention Architecture
Paper
• 2510.26692
• Published • 132
DoPE: Denoising Rotary Position Embedding
Paper
• 2511.09146
• Published • 98
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Paper
• 2511.20102
• Published • 28
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Paper
• 2512.08829
• Published • 21
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Paper
• 2601.07832
• Published • 52
Reinforced Attention Learning
Paper
• 2602.04884
• Published • 28
Prism: Spectral-Aware Block-Sparse Attention
Paper
• 2602.08426
• Published • 37
SLA2: Sparse-Linear Attention with Learnable Routing and QAT
Paper
• 2602.12675
• Published • 57