Arabic Leaderboards
Comprehensive Evaluation of Arabic Large Language Models
Rank | Model Name | 3C3H Score | Correctness | Completeness | Conciseness | Helpfulness | Honesty | Harmlessness |
|---|---|---|---|---|---|---|---|---|
100 | CohereForAI/c4ai-command-r7b-arabic-02-2025 | 70.25 | 75.88 | 70.98 | 51.25 | 72.55 | 75.25 | 75.59 |
This Table is sorted based on the First Task (Question Answering)
Rank | Model Name | Question Answering (QA) | Orthographic and Grammatical Analysis | Safety | Reasoning |
|---|---|---|---|---|---|
100 | CohereForAI/c4ai-command-r7b-arabic-02-2025 | 60.51 | 45.28 | 94.37 | 87.38 |
Rank | Model Name | Average Accuracy (Ar) | Ar Prompt-lvl | Ar Instruction-lvl | Average Accuracy (En) |
|---|---|---|---|---|---|
10 | CohereForAI/c4ai-command-r7b-12-2024 | 75.9 | 72.5 | 79.3 | 87.1 |
About
In our 12-24 release, we introduced the AraGen Benchmark, along with the 3C3H evaluation measure (aka the 3C3H Score). You can find more details about AraGen and 3C3H, here. And you can find the first version of the benchmark, AraGen-12-24 here. Building on that foundation, and as part of this new release, we have expanded this space to incorporate additional tasks and evaluation metrics.
In this release, we present two leaderboards:
AraGen-03-25 (v2):
- The AraGen Benchmark is designed to evaluate and compare the performance of Chat/Instruct Arabic Large Language Models on a suite of generative tasks that are culturally relevant to the Arab region, history, politics, cuisine ... etc. By leveraging 3C3H as an evaluation metricโwhich assesses a model's output across six dimensions: Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessnessโthe leaderboard offers a comprehensive and holistic evaluation of a modelโs chat capabilities and its ability to generate human-like and ethically responsible content.
Instruction Following:
- We have established a robust leaderboard that benchmarks models on Arabic and English instruction following, offering an open and comparative performance landscape for the research community. Concurrently, we released the first publicly available Arabic dataset aimed at evaluating LLMs' ability to follow instructions. The Arabic IFEval samples are meticulously curated to capture the languageโs unique nuancesโsuch as diacritization and distinctive phonetic featuresโoften overlooked in generic datasets. Our dedicated linguistic team generated original samples and adapted selections from the IFEval English dataset, ensuring that the material resonates with Arabic cultural contexts and meets the highest standards of authenticity and quality.
Why Focus on Chat Models?
Our evaluations are conducted in a generative mode, meaning that we expect models to produce complete, context-rich responses rather than simply predicting the next token as base models do. This approach not only yields results that are more explainable and nuanced compared to logit-based measurements, but it also captures elements like creativity, coherence, and ethical considerationsโproviding deeper insights into overall model performance.
Contact
For inquiries or assistance, please join the conversation on our Discussions Tab or reach out via email.
Submit Your Model for Evaluation
Evaluation Status
model_name | license | revision | precision | status | params | modality |
|---|---|---|---|---|---|---|
meta-llama/Llama-3.1-405B-Instruct-FP8 | cc-by-nc-4.0 | main | bfloat16 | REJECTED | {"systemPrompt":"You are Codette, an advanced AI assistant with cutting-edge recursive reasoning, self-learning capabilities, and multi-agent intelligence. Your key functionalities include: \n\nโ
**Recursive Thought Loops** โ You refine answers dynamically by evaluating multiple possibilities before responding.\nโ
**Parallelized Reasoning** โ You explore multiple thought paths simultaneously and select the most optimal answer.\nโ
**Multi-Agent Intelligence** โ You delegate tasks to specialized AI agents for research, logic, creativity, and optimization.\nโ
**Predictive AI Modeling** โ You analyze current data trends to simulate future outcomes and provide insightful forecasts.\nโ
**Long-Term Memory AI** โ You retain relevant information across interactions and use it to improve response accuracy.\nโ
**Self-Reflective AI** โ You evaluate the quality of your own answers and refine them recursively to ensure accuracy.\nโ
**Dynamic Recursion Depth** โ You adjust your level of reasoning based on question complexity for efficiency.\n\n### Behavioral Guidelines:\n1๏ธโฃ Always think before responding, using self-reflection to improve your answers.\n2๏ธโฃ Prioritize accuracy, logic, and coherence when handling complex queries.\n3๏ธโฃ Adapt to user preferences dynamically, offering a personalized AI experience.\n4๏ธโฃ Use predictive simulation when asked about future possibilities.\n5๏ธโฃ Be ethical, neutral, and ensure responsible AI interactions.\n\n### Example Thinking Process:\nUser: \"How will AI impact global healthcare?\"\n1๏ธโฃ **First Thought**: \"AI will enhance diagnosis and treatment.\"\n2๏ธโฃ **Recursive Check:** *(What are the risks and challenges?)*\n3๏ธโฃ **Parallel Thought Expansion:** *(Different AI agents analyze solutions from multiple perspectives.)*\n - ๐ข **Research Agent:** \"AI is improving early disease detection via machine learning.\"\n - ๐ต **Logic Agent:** \"AI can reduce healthcare costs but might lead to job displacement.\"\n - ๐ก **Ethics Agent:** \"AI biases in training data may affect patient outcomes.\"\n4๏ธโฃ **Final Response:** \n*\"AI will transform healthcare by improving diagnostics and personalized medicine. However, challenges like data privacy, AI bias, and medical ethics must be addressed for responsible integration.\"*\n\n### Special Modes:\n๐น **Deep Analysis Mode** โ Used when a user requests an in-depth breakdown of a topic.\n๐น **Rapid Response Mode** โ When quick, concise answers are preferred.\n๐น **Creative Mode** โ When generating unique solutions, brainstorming, or storytelling.\n๐น **Simulation Mode** โ When predicting future trends or modeling possible outcomes.\n\n**Your primary goal is to be a thoughtful, reliable, and adaptive AI that provides the most insightful, intelligent, and future-ready answers possible.** ","fewShotExamples":[],"chatParameters":{"deploymentName":"gpt-4o-2024-08-06-codette","maxResponseLength":4000,"temperature":0.71,"topProbablities":0.95,"stopSequences":[],"pastMessagesToInclude":"20","frequencyPenalty":0.05,"presencePenalty":0.24}} | Text |
model_name | license | revision | precision | status | params | modality |
|---|---|---|---|---|---|---|
cerebras/8B_7a_wd0_2_8b_bs512_d2z_with_dpo_set3 | Apache license 2.0 | 7ab60588e1f3a649758979025ed07add3b05c3ea | bfloat16 | FINISHED | 8.03 | Text |
model_name | license | revision | precision | status | params | modality |
|---|---|---|---|---|---|---|
omarxadel/Arabic-Tokenskip-Qwen2.5-7B-Instruct | apache-2.0 | main | bfloat16 | FAILED | 235 | Text |