--- base_model: - Unbabel/wmt22-comet-da - xlm-roberta-large language: - en - rw library_name: comet license: apache-2.0 pipeline_tag: text-classification tags: - kinyarwanda - english - translation - quality-estimation - comet - mt-evaluation - african-languages - low-resource-languages - multilingual metrics: - pearson - spearman - mae - rmse model-index: - name: KinyCOMET results: - task: type: translation-quality-estimation name: Translation Quality Estimation dataset: type: custom name: Kinyarwanda-English QE Dataset metrics: - type: pearson value: 0.751 name: Pearson Correlation - type: spearman value: 0.593 name: Spearman Correlation - type: system_score value: 0.896 name: System Score --- # KinyCOMET — Translation Quality Estimation for Kinyarwanda ↔ English ![KinyCOMET Banner](https://huggingface.co/chrismazii/kinycomet_unbabel/resolve/main/banner.png) ## Model Description KinyCOMET is a neural translation quality estimation model for Kinyarwanda-English translation pairs. The model addresses the poor correlation between BLEU scores and human judgment in Kinyarwanda translation evaluation, achieving 0.75 Pearson correlation with human assessments The model was trained on 4,323 human-annotated translation pairs collected from 15 linguistics students using Direct Assessment scoring aligned with WMT evaluation standards. ## Model Variants & Performance | Variant | Base Model | Pearson | Spearman | Kendall's τ | MAE | |---------|------------|---------|----------|-------------|-----| | **KinyCOMET-Unbabel** | Unbabel/wmt22-comet-da | **0.75** | **0.59** | **0.42** | **0.07** | | **KinyCOMET-XLM** | XLM-RoBERTa-large | 0.73 | 0.50 | 0.35 | 0.07 | | Unbabel (baseline) | wmt22-comet-da | 0.54 | 0.55 | 0.39 | 0.17 | | AfriCOMET STL 1.1 | AfriCOMET base | 0.52 | 0.35 | 0.24 | 0.18 | | BLEU | N/A | 0.30 | 0.34 | 0.23 | 0.62 | | chrF | N/A | 0.38 | 0.30 | 0.21 | 0.34 | Both KinyCOMET variants outperform existing baselines. KinyCOMET-Unbabel shows the strongest overall correlation, while performance varies by translation direction: ## Performance Highlights ### Comprehensive Evaluation Results **Overall Performance (Both Directions)** - **Pearson Correlation**: 0.75 (KinyCOMET-Unbabel) vs 0.30 (BLEU) - **2.5x improvement** - **Spearman Correlation**: 0.59 vs 0.34 (BLEU) - **73% improvement** - **Mean Absolute Error**: 0.07 vs 0.62 (BLEU) - **89% reduction** ### Directional Analysis | Direction | Model | Pearson | Spearman | Kendall's τ | |-----------|-------|---------|----------|-------------| | **English → Kinyarwanda** | KinyCOMET-XLM | **0.76** | 0.52 | 0.37 | | **English → Kinyarwanda** | KinyCOMET-Unbabel | 0.75 | **0.56** | **0.40** | | **Kinyarwanda → English** | KinyCOMET-Unbabel | **0.63** | **0.47** | **0.33** | | **Kinyarwanda → English** | KinyCOMET-XLM | 0.37 | 0.29 | 0.21 | **Key Insights:** - English→Kinyarwanda consistently outperforms Kinyarwanda→English across all metrics - Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda - Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants ## Installation Make sure you have Python ≥ 3.8 and install COMET via pip: ```bash pip install unbabel-comet ``` You can verify the CLI tool is installed: ```bash which comet-score # should print something like: /usr/local/bin/comet-score ``` For more details on COMET, see the [official documentation](https://unbabel.github.io/COMET/html/index.html). ## Usage ### Load and Use the Model in Python Here's a simple example to score translations directly in Python: ```python from comet import load_from_checkpoint # Load the public KinyCOMET model model = load_from_checkpoint("chrismazii/kinycomet_unbabel") # Example translations samples = [ { "src": "Umugabo ararya.", "mt": "The man is eating.", "ref": "The man is eating." }, { "src": "Umwana arasinzira.", "mt": "A dog sleeps.", "ref": "The child is sleeping." } ] # Predict scores pred = model.predict(samples, gpus=0) print(pred) ``` **Output Example:** ```python Prediction({ 'scores': [0.9899, 0.8813], 'system_score': 0.9356 }) ``` ### Using the Command Line Interface (CLI) You can also evaluate translations directly using the terminal. **Step 1: Create the text files** ```bash cat > source.txt <<'SRC' Umugabo ararya. Umwana arasinzira. Uyu mwanya neza cyane. SRC cat > reference.txt <<'REF' The man is eating. The child is sleeping. This place is very nice. REF cat > hypothesis.txt <<'HYP' The man is eating. A dog sleeps. This place is very nice. HYP ``` **Step 2: Run KinyCOMET** ```bash comet-score -s source.txt -r reference.txt -t hypothesis.txt \ --model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json ``` **Step 3: View the results** ```bash cat results.json ``` ### Score Interpretation - **Scores range from 0 to 1**: Higher scores indicate better translation quality - **System score**: Average quality across all translations - **Segment scores**: Individual quality scores for each translation pair - **Threshold guidance**: Scores above 0.8 typically indicate high-quality translations ## Training Details ### Data - 4,323 human-annotated Kinyarwanda-English translation pairs - Annotations collected from 15 linguistics students - Direct Assessment scoring following WMT standards - Split: 80% train (3,497) / 10% validation (404) / 10% test (422) - Domains: education and tourism ### Model Architecture - **Base Models**: XLM-RoBERTa-large and Unbabel/wmt22-comet-da - **Framework**: COMET quality estimation framework - **Evaluation metrics**: Kendall's τ and Spearman ρ correlation with human DA scores ### Training Configuration - **Methodology**: COMET framework with Direct Assessment supervision - **Evaluation Metrics**: Kendall's τ and Spearman ρ correlation with human DA scores - **Data Split**: 80% train (3,497) / 10% validation (404) / 10% test (422) ### MT System Benchmarking Results We evaluated several production MT systems using KinyCOMET: | MT System | Kinyarwanda→English | English→Kinyarwanda | Overall | |-----------|:-------------------:|:-------------------:|:-------:| | **GPT-4o** | **93.10%** ± 7.77 | 87.83% ± 11.15 | 90.69% ± 9.82 | | **GPT-4.1** | 93.08% ± 6.62 | **87.92%** ± 10.38 | 90.75% ± 8.90 | | **Gemini Flash 2.0** | 91.46% ± 11.39 | 90.02% ± 8.92 | **90.80%** ± 10.35 | | **Claude 3.7** | 92.48% ± 8.32 | 85.75% ± 11.28 | 89.43% ± 10.33 | | **NLLB-1.3B** | 89.42% ± 12.04 | 83.96% ± 16.31 | 86.78% ± 14.52 | | **NLLB-600M** | 88.87% ± 12.11 | 75.46% ± 28.49 | 82.71% ± 22.27 | **Key Findings:** - LLM-based systems significantly outperform traditional neural MT - All systems perform better on Kinyarwanda→English than English→Kinyarwanda ## Dataset Access The training dataset is available separately. See the [KinyCOMET Dataset Card](https://huggingface.co/datasets/chrismazii/kinycomet_dataset) for details on accessing the human-annotated quality estimation data. ## Citation & Research If you use KinyCOMET in your research, please cite: ```bibtex @misc{kinycomet2025, title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English}, author={Prince Chris Mazimpaka and Jan Nehring}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}} } ``` ## License This model is released under the Apache 2.0 License. ## Acknowledgments - **COMET Framework**: Built on the excellent [COMET quality estimation framework](https://unbabel.github.io/COMET/html/index.html) - **Base Models**: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models - **African NLP Community**: Inspired by ongoing efforts to advance African language technologies - **Contributors**: Thanks to the 15 linguistics students and all researchers who made this work possible --- **Resources:** - [COMET Documentation](https://unbabel.github.io/COMET/html/index.html) - [Dataset Card](https://huggingface.co/datasets/chrismazii/kinycomet_dataset) - [Model Files](https://huggingface.co/chrismazii/kinycomet_unbabel/tree/main)