-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Paper • 2604.04746 • Published • 72 -
ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation
Paper • 2602.00744 • Published • 12 -
Language-Guided Music Recommendation for Video via Prompt Analogies
Paper • 2306.09327 • Published • 10
Collections
Discover the best community collections!
Collections including paper arxiv:2604.04746
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper • 2601.03233 • Published • 180 -
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Paper • 2601.07832 • Published • 53 -
Motion Attribution for Video Generation
Paper • 2601.08828 • Published • 72 -
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 27
-
CoLLM: A Large Language Model for Composed Image Retrieval
Paper • 2503.19910 • Published • 15 -
LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
Paper • 2503.21541 • Published • 1 -
HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration
Paper • 2504.03536 • Published • 13 -
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
Paper • 2504.04842 • Published • 35
-
Contrastive Decoding Improves Reasoning in Large Language Models
Paper • 2309.09117 • Published • 40 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 57 -
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
Paper • 2411.04282 • Published • 37 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 25
-
AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation
Paper • 2602.17100 • Published • 4 -
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
Paper • 2603.01059 • Published • 1 -
Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models
Paper • 2603.00618 • Published -
Heterogeneous Agent Collaborative Reinforcement Learning
Paper • 2603.02604 • Published • 197
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper • 2507.13348 • Published • 79 -
RewardDance: Reward Scaling in Visual Generation
Paper • 2509.08826 • Published • 73 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper • 2510.18876 • Published • 37
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 37 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 27 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 11 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
-
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning
Paper • 2604.04746 • Published • 72 -
ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation
Paper • 2602.00744 • Published • 12 -
Language-Guided Music Recommendation for Video via Prompt Analogies
Paper • 2306.09327 • Published • 10
-
AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation
Paper • 2602.17100 • Published • 4 -
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
Paper • 2603.01059 • Published • 1 -
Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models
Paper • 2603.00618 • Published -
Heterogeneous Agent Collaborative Reinforcement Learning
Paper • 2603.02604 • Published • 197
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper • 2601.03233 • Published • 180 -
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Paper • 2601.07832 • Published • 53 -
Motion Attribution for Video Generation
Paper • 2601.08828 • Published • 72 -
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
Paper • 2601.19895 • Published • 27
-
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper • 2506.22434 • Published • 10 -
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Paper • 2507.13348 • Published • 79 -
RewardDance: Reward Scaling in Visual Generation
Paper • 2509.08826 • Published • 73 -
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper • 2510.18876 • Published • 37
-
CoLLM: A Large Language Model for Composed Image Retrieval
Paper • 2503.19910 • Published • 15 -
LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
Paper • 2503.21541 • Published • 1 -
HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration
Paper • 2504.03536 • Published • 13 -
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
Paper • 2504.04842 • Published • 35
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 37 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 27 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 11 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
-
Contrastive Decoding Improves Reasoning in Large Language Models
Paper • 2309.09117 • Published • 40 -
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 57 -
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding
Paper • 2411.04282 • Published • 37 -
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Paper • 2411.14432 • Published • 25