Update README.md
Browse files
README.md
CHANGED
|
@@ -76,17 +76,17 @@ Below we report performance across general, reasoning, mathematical, and coding
|
|
| 76 |
|
| 77 |
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|
| 78 |
|
| 79 |
-
| Metric / Model | **K2 Low**<br><sub>Dense 路 70B</sub> | **K2 Medium**<br><sub>Dense 路 70B</sub> | **K2 High**<br><sub>Dense 路 70B</sub> | **Olmo3 Think SFT**<br><sub>Dense 路 32B 路 No RL</sub> | **Olmo3 Think**<br><sub>Dense 路 32B 路 RL</sub> | **
|
| 80 |
-
|
| 81 |
-
| **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 60.9 |
|
| 82 |
-
| **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 |
|
| 83 |
-
| **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 |
|
| 84 |
-
| **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 |
|
| 85 |
-
| **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 |
|
| 86 |
-
| **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 |
|
| 87 |
-
| **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 |
|
| 88 |
-
| **HumanEval** | 82.3 |
|
| 89 |
-
| **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 72.8 |
|
| 90 |
|
| 91 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 92 |
|
|
|
|
| 76 |
|
| 77 |
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|
| 78 |
|
| 79 |
+
| Metric / Model | **K2 Low**<br><sub>Dense 路 70B</sub> | **K2 Medium**<br><sub>Dense 路 70B</sub> | **K2 High**<br><sub>Dense 路 70B</sub> | **Olmo3 Think SFT**<br><sub>Dense 路 32B 路 No RL</sub> | **Olmo3 Think**<br><sub>Dense 路 32B 路 RL</sub> | **GLM-4.5 Air**<br><sub>MoE 路 106B A12B</sub> | **MiniMax-M2**<br><sub>MoE 路 230B A10B</sub> | **Qwen3 235B**<br><sub>MoE 路 235B A22B 路 Reasoning</sub> | **Qwen 2.5 72B**<br><sub>Dense 路 72B</sub> |
|
| 80 |
+
|--------|--------------------------------------|------------------------------------------|----------------------------------------|------------------------------------------------------|--------------------------------------------------|----------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------|
|
| 81 |
+
| **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 49.4 | 55.8 | 60.9 | 47.2 |
|
| 82 |
+
| **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 | 81.3 | 75.8 | 84.2 | 15.2 |
|
| 83 |
+
| **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 | 73.3 | 63.5 | 93.5 | 9.79 |
|
| 84 |
+
| **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 | 96.1 | 95.4 | 98.0 | 85.8 |
|
| 85 |
+
| **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 | 94.9 | 85.3 | 80.7 | 82.1 |
|
| 86 |
+
| **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 | 75.3 | 76.2 | 96.2 | 50.5 |
|
| 87 |
+
| **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 | 82.8 | 83.8 | 94.5 | 80.0 |
|
| 88 |
+
| **HumanEval** | 82.3 | 91.5 | 91.5 | 96.3 | 96.3 | 97.6 | 89.6 | 94.5 | 85.4 |
|
| 89 |
+
| **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 67.8 | 79.2 | 72.8 | 36.7 |
|
| 90 |
|
| 91 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 92 |
|