Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Paper • 2510.18876 • Published • 37
None defined yet.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification