LLM360
/

K2-V2-Instruct

PyTorch

English

llama

Model card Files Files and versions

xet

Community

desaifan-mbzuai commited on 7 days ago

Commit

212198f

verified ·

1 Parent(s): 08a6d41

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -76,17 +76,17 @@ Below we report performance across general, reasoning, mathematical, and coding
 Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
-| Metric / Model | **K2 Low**<br><sub>Dense · 70B</sub> | **K2 Medium**<br><sub>Dense · 70B</sub> | **K2 High**<br><sub>Dense · 70B</sub> | **Olmo3 Think SFT**<br><sub>Dense · 32B · No RL</sub> | **Olmo3 Think**<br><sub>Dense · 32B · RL</sub> | **Qwen3 235B**<br><sub>MoE · 235B A22B · Reasoning</sub> | **Qwen3 235B 2507**<br><sub>MoE · 235B A22B · Instruct</sub> |
-|----------------|----------------|----------------|----------------|-----------------------------|------------------------------|------------------------------|-------------------------------|
-| **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 60.9 | 52.7 |
-| **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 | 88.8 | 67.9 |
-| **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 | 84.2 | 50.63 |
-| **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 | 93.5 | 94.9 |
-| **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 | 98.0 | 97.1 |
-| **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 | 80.7 | 73.9 |
-| **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 | 96.2 | 85.6 |
-| **HumanEval** | 82.3 | 84.2 | 91.5 | 96.3 | 96.3 | 94.5 | 95.7 |
-| **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 72.8 | 59.3 |
 Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.

 Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
+| Metric / Model | **K2 Low**<br><sub>Dense · 70B</sub> | **K2 Medium**<br><sub>Dense · 70B</sub> | **K2 High**<br><sub>Dense · 70B</sub> | **Olmo3 Think SFT**<br><sub>Dense · 32B · No RL</sub> | **Olmo3 Think**<br><sub>Dense · 32B · RL</sub> | **GLM-4.5 Air**<br><sub>MoE · 106B A12B</sub> | **MiniMax-M2**<br><sub>MoE · 230B A10B</sub> | **Qwen3 235B**<br><sub>MoE · 235B A22B · Reasoning</sub> | **Qwen 2.5 72B**<br><sub>Dense · 72B</sub> |
+|--------|--------------------------------------|------------------------------------------|----------------------------------------|------------------------------------------------------|--------------------------------------------------|----------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------|
+| **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 49.4 | 55.8 | 60.9 | 47.2 |
+| **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 | 81.3 | 75.8 | 84.2 | 15.2 |
+| **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 | 73.3 | 63.5 | 93.5 | 9.79 |
+| **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 | 96.1 | 95.4 | 98.0 | 85.8 |
+| **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 | 94.9 | 85.3 | 80.7 | 82.1 |
+| **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 | 75.3 | 76.2 | 96.2 | 50.5 |
+| **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 | 82.8 | 83.8 | 94.5 | 80.0 |
+| **HumanEval** | 82.3 | 91.5 | 91.5 | 96.3 | 96.3 | 97.6 | 89.6 | 94.5 | 85.4 |
+| **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 67.8 | 79.2 | 72.8 | 36.7 |
 Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.