-
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 29 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Scalable-Softmax Is Superior for Attention
Paper • 2501.19399 • Published • 24 -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 8
Collections
Discover the best community collections!
Collections including paper arxiv:2501.06425
-
Nuclear Norm Regularization for Deep Learning
Paper • 2405.14544 • Published • 1 -
Token embeddings violate the manifold hypothesis
Paper • 2504.01002 • Published • 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper • 2403.10476 • Published • 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper • 2504.00254 • Published • 1
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Paper • 2501.11873 • Published • 66 -
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Paper • 2502.11089 • Published • 166 -
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper • 2502.13189 • Published • 17
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper • 2501.09747 • Published • 27 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper • 2501.06842 • Published • 16 -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 52
-
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 -
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 -
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 40 -
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148
-
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Paper • 2311.11077 • Published • 29 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
LoRA: Low-Rank Adaptation of Large Language Models
Paper • 2106.09685 • Published • 55 -
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Paper • 2403.03853 • Published • 66
-
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 -
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper • 2412.06559 • Published • 84 -
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
Paper • 2412.15084 • Published • 13 -
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Paper • 2501.07301 • Published • 99
-
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 29 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Scalable-Softmax Is Superior for Attention
Paper • 2501.19399 • Published • 24 -
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Paper • 2502.09509 • Published • 8
-
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 -
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 -
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 40 -
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148
-
Nuclear Norm Regularization for Deep Learning
Paper • 2405.14544 • Published • 1 -
Token embeddings violate the manifold hypothesis
Paper • 2504.01002 • Published • 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper • 2403.10476 • Published • 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper • 2504.00254 • Published • 1
-
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Paper • 2311.11077 • Published • 29 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
LoRA: Low-Rank Adaptation of Large Language Models
Paper • 2106.09685 • Published • 55 -
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Paper • 2403.03853 • Published • 66
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Paper • 2501.11873 • Published • 66 -
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Paper • 2502.11089 • Published • 166 -
MoBA: Mixture of Block Attention for Long-Context LLMs
Paper • 2502.13189 • Published • 17
-
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 -
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper • 2412.06559 • Published • 84 -
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
Paper • 2412.15084 • Published • 13 -
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Paper • 2501.07301 • Published • 99
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper • 2501.09747 • Published • 27 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper • 2501.06842 • Published • 16 -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 52