moonshotai
/

Kimi-VL-A3B-Thinking

@@ -1,17 +1,15 @@
 ---
-license: mit
 base_model:
 - moonshotai/Kimi-VL-A3B-Instruct
 pipeline_tag: image-text-to-text
 ---
 <div align="center">
   <img width="30%" src="figures/logo.png">
 </div>
 ## Introduction
 We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
@@ -26,6 +24,8 @@ Kimi-VL also advances the pareto frontiers of multimodal models in processing lo
 Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
 ## Architecture
 The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
@@ -59,7 +59,6 @@ Full comparison on MMMU, MathVision, and MathVista-mini:
 <div align="center">
 | Benchmark (Metric)              | GPT-4o | GPT-4o-mini | Qwen2.5-VL-72B | Qwen2.5-VL-7B | Gemma-3-27B | Gemma-3-12B | o1-1217 | QVQ-72B | Kimi-k1.5 | Kimi-VL-Thinking-A3B |
 |---------------------------------|--------|-------------|----------------|---------------|-------------|-------------|---------|----------|-----------|----------------------|
 | *Thinking Model?*              |        |             |                |               |             |             | ✅       | ✅        | ✅         | ✅                    |
@@ -67,7 +66,6 @@ Full comparison on MMMU, MathVision, and MathVista-mini:
 | MathVista (mini) (Pass@1)       | 63.8   | 56.7        | 74.8           | 68.2          | 62.3        | 56.4        | 71.0    | 71.4     | 74.9      | 71.3                 |
 | MMMU (val) (Pass@1)             | 69.1   | 60.0        | 74.8           | 58.6          | 64.8        | 59.6        | 77.3    | 70.3     | 70.0      | 61.7                 |
 </div>
 ### Inference with 🤗 Hugging Face Transformers
@@ -113,7 +111,7 @@ print(response)
 We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
-## Citation
 ```
 @misc{kimiteam2025kimivltechnicalreport,
@@ -125,5 +123,4 @@ We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2504.07491},
 }
-```

 ---
 base_model:
 - moonshotai/Kimi-VL-A3B-Instruct
+license: mit
 pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 <div align="center">
   <img width="30%" src="figures/logo.png">
 </div>
 ## Introduction
 We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
 Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
+More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
 ## Architecture
 The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
 <div align="center">
 | Benchmark (Metric)              | GPT-4o | GPT-4o-mini | Qwen2.5-VL-72B | Qwen2.5-VL-7B | Gemma-3-27B | Gemma-3-12B | o1-1217 | QVQ-72B | Kimi-k1.5 | Kimi-VL-Thinking-A3B |
 |---------------------------------|--------|-------------|----------------|---------------|-------------|-------------|---------|----------|-----------|----------------------|
 | *Thinking Model?*              |        |             |                |               |             |             | ✅       | ✅        | ✅         | ✅                    |
 | MathVista (mini) (Pass@1)       | 63.8   | 56.7        | 74.8           | 68.2          | 62.3        | 56.4        | 71.0    | 71.4     | 74.9      | 71.3                 |
 | MMMU (val) (Pass@1)             | 69.1   | 60.0        | 74.8           | 58.6          | 64.8        | 59.6        | 77.3    | 70.3     | 70.0      | 61.7                 |
 </div>
 ### Inference with 🤗 Hugging Face Transformers
 We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
+## 8. Citation
 ```
 @misc{kimiteam2025kimivltechnicalreport,
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2504.07491},
 }
+```