Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus

Community Article Published May 13, 2026

1. The problem — and why it's bigger than translation
2. Why low-resource NLP is hard
3. Our approach — vocabulary-augmented prompting
4. Technical gotchas we hit
5. The dataset — MEYNG/sango-vocabulary
6. What's next
7. Try it, use it, cite it
Acknowledgments
TL;DR — Sango is the national language of the Central African Republic, spoken by 5 million people. In May 2026, Google Translate added Sango — a meaningful moment that brings the language into the mainstream AI conversation. What it also shows is where the deeper work lies: domain-specific vocabulary, grammar explanation, learning infrastructure, community-verified accuracy. We built SangoAI to address exactly that layer, using vocabulary-augmented prompting with a frontier LLM — no fine-tuning, no parallel corpus, no training compute. This post explains the method, the dataset (MEYNG/sango-vocabulary), the production lessons, and the transferable recipe for the ~2,000 African languages that still need this kind of specialized infrastructure.

Author: Michel WENEZOUI — Founder, MEYNG · sangoai.sbs · meyng.com

1. The problem — and why it's bigger than translation

On May 6, 2026, Google Translate added Sango to its supported language list. That is genuinely good news for the ecosystem — it brings Sango into the mainstream AI conversation and reduces the barrier for any organization that needs a first pass at Sango text.

It also brings into focus a structural reality of NMT for zero-resource languages. NMT quality is proportional to parallel corpus size. For Sango, the available digital corpus consists primarily of religious texts, a small set of NGO/UN documents, and Wikipedia stubs — perhaps a few megabytes of parallel data in total. NMT trained on that corpus naturally excels at the vocabulary those sources cover. The oral language of Bangui's daily markets, specialized medical terminology, humanitarian field protocols, and cultural register distinctions require dedicated vocabulary infrastructure that training data alone, at this scale, cannot yet provide. This is a data constraint, not an architectural one — and it is the same constraint that faces every zero-resource African language.

We built SangoAI to address the infrastructure layer above general-purpose translation: a platform that teaches the language, serves field organizations with domain-specific vocabulary, and is built by people rooted in Central Africa. The two approaches are complementary — general-purpose translation handles common vocabulary; specialized infrastructure handles the depth that field organizations and learners actually need.

The scale of what remains: 5 million Sango speakers. Every market in Bangui, every school, every hospital operates in Sango. Humanitarian organizations (MSF, UNICEF, WHO, WFP) operating in CAR need AI translation capability in the language their beneficiaries actually speak. Every digital service, from government portals to mobile banking, remains available only in French — a language that excludes the majority of CAR citizens.

This isn't an edge case. Sango is representative of roughly 2,000 African languages that still need this kind of specialized infrastructure — vocabulary curation, domain accuracy, learning platforms, audio, community verification. General-purpose translation is the starting line. The rest of the work is what this post is about.

2. Why low-resource NLP is hard

The classical neural machine translation (NMT) playbook is well-known:

Collect a large parallel corpus of source ↔ target sentence pairs (≥10M pairs is the rough comfort zone)
Train or fine-tune a sequence-to-sequence model on that corpus
Evaluate with BLEU / chrF / COMET, deploy, iterate

For Sango, step 1 is a non-starter.

There is no Sango–French parallel corpus of usable size. There is no Sango Wikipedia of useful scale (fewer than 200 articles, most stubs). A few hundred scholarly papers contain Sango text — behind academic paywalls, some from the colonial period, with inconsistent orthography and OCR errors compounding each other. The total quantity of digitized Sango text available on the open internet is measured in megabytes, not gigabytes.

This breaks every assumption the field is built on. Meta's NLLB (No Language Left Behind) project is admirable and Sango is nominally covered, but output quality is research-grade, not production-grade. Masakhane, the African NLP community, has built excellent datasets for many African languages — but their focus is open research, not commercial APIs, and Sango specifically has limited Masakhane coverage.

Fine-tuning has its own problem: even if you could scrape together 50,000 clean Sango sentences (we couldn't), you would need expensive training compute, a model you control, ongoing MLOps infrastructure, and someone to maintain the training pipeline. For a solo-founder project, that multiplies engineering headcount and costs by 10×. It also locks you into one-language-at-a-time scaling: every new language is another training run, another evaluation cycle, another set of hyperparameters.

We chose a different path: vocabulary-augmented prompting with a frontier general-purpose LLM, grounded in a tightly curated vocabulary database and language-specific system prompts.

This approach trades theoretical elegance for production traction. It is not what gets published at ACL. But it turns out to be the only thing that actually works for languages at Sango's resource level today, and it scales to new low-resource languages with roughly 3–4 months of curated vocabulary work per language, rather than 6–12 months of data engineering plus compute costs per language.

Here is exactly how we did it.

3. Our approach — vocabulary-augmented prompting

The core insight: for a zero-resource language, the most valuable thing you can give a frontier LLM at inference time is a small, high-quality lexical grounding — not a fine-tune.

The specific contribution of this post: for a language at Sango's data level, a sub-1,000-entry native-speaker-verified lexicon plus a small number of hand-written grammar rules, injected into a frontier instruction-following LLM's context at inference time, yields production-quality in-domain translation — without fine-tuning, parallel corpus, or training compute. The recipe (curated lexicon + rule prompt + frontier LLM + per-word uncertainty marking → user-driven dataset growth loop) has not, to our knowledge, been published as a named methodology for zero-resource MT. We call it vocabulary-augmented prompting, and we think it generalizes to most low-resource African languages that share Sango's data-poor starting conditions. The rest of this section is how the pipeline is actually wired.

Our pipeline on each translation request:

Lexical retrieval: the source text is tokenized (diacritic-aware) and matched against our curated Sango vocabulary (581 verified entries as of May 2026).
Context assembly: matched entries — word, translation, part of speech, example sentence if present — are assembled into a compact few-shot block.
Language-rule system prompt: a short, hand-written specification of Sango's relevant grammar (SVO word order, tonal diacritics â ê ë î ö ô û ü, copula usage, plural marking via the preposed prefix â- (e.g. zo "person" → âzo "people") rather than by suffix, borrowed Ngbandi structure for low-frequency constructions).
Frontier LLM call: retrieval block + rules + user input go to a general-purpose LLM. Output returns directly to the user.

No embeddings. No vector DB. No fine-tuning. No training compute. The entire retrieval layer is a lexical inverted index over 581 production-verified entries — fits in memory, queries in microseconds.

Here's a simplified version of the system prompt assembly:

def build_sango_to_french_prompt(source_text: str, vocab_db) -> str:
    matches = vocab_db.lexical_match(source_text)  # O(n) over 581 entries
    examples = "\n".join(
        f"- {m.sango}  →  {m.french}  ({m.part_of_speech}"
        + (f"; e.g. \"{m.example_sango}\" / \"{m.example_french}\"" if m.example_sango else "")
        + ")"
        for m in matches[:15]  # Cap context; LLM attention is precious
    )

    return f"""You are a Sango-French translator.

Sango language rules:
- Word order is SVO (subject-verb-object).
- Tonal diacritics (â, ê, ë, î, ö, ô, û, ü) carry meaning — preserve them exactly in both directions.
- Plural is marked by the preposed prefix `â-` (e.g. `âzo` = "people", from singular `zo` = "person"), not by suffix.
- Default to literal mapping; Sango grammar is closer to French than to English for most constructions.
- If a source word is not in the vocabulary below and not a proper noun, mark the translation with [uncertain:word].

Verified vocabulary matches for this sentence:
{examples}

Translate the following Sango text to French. Return only the French translation — no commentary.

Sango: {source_text}
French:"""

Three properties of this design matter:

It's model-agnostic. The same prompt structure works with any frontier instruction-following LLM. We've validated it across multiple foundation models with only small accuracy deltas — meaning the moat is the vocabulary + rules, not vendor lock-in.

It's language-agnostic. Adding Ewondo or Lingala is a vocabulary-curation project plus ~3 lines of grammar rules plus a new prompt template. No retraining. No new infrastructure.

It's honest about what it doesn't do. The [uncertain:word] marker surfaces gaps to the user and feeds directly into our dataset growth roadmap. We know exactly what we're missing because the users tell us.

Qualitative feedback from native-speaker testers on CAR-context Sango sentences (greetings, basic commerce, health vocabulary, primary-school curriculum) has been consistently positive since launch — but we haven't yet built a reproducible evaluation harness, and we flag that honestly here. Building a shared, adversarial, native-speaker-rated eval set for Sango is one of the collaborations we'd most like to start: see section 6. What we can say is that most major translation platforms currently return no output for Sango — the language simply isn't in their training data at useful scale. Google's May 2026 addition is a meaningful step for the ecosystem. Our focus is the specialized layer above foundational translation: domain-specific terminology, grammar explanation, and the learning infrastructure that NGOs and learners working in the field actually need. The two are complementary: the better general-purpose Sango translation gets, the more useful specialized vocabulary infrastructure becomes for the cases that matter most.

What this method does not do. It doesn't handle long-form literary translation — attention quality degrades past roughly three sentences with a 15-entry lexical block. It doesn't do speech (we have no native-speaker audio corpus yet, which is why pronunciation practice is currently disabled in production). It doesn't beat a properly fine-tuned NLLB checkpoint on BLEU for the languages where NLLB has real training data to work with; we are competitive specifically in the zero-data regime where fine-tuning isn't available as an option. And "consistently positive qualitative feedback" is not the same as "competitive on a standardized benchmark" — we would welcome collaboration on an adversarial Sango evaluation set, which is the single most useful thing the research community could contribute to this methodology's credibility.

4. Technical gotchas we hit

Three lessons from six months of production traffic:

1. UTF-8 double-encoding corrupts your diacritics in silence. Somewhere in a typical managed-cloud-function pipeline, Sango text with tonal diacritics (makâko, köndöngö, Bêafrîka) gets interpreted as Latin-1, then re-encoded as UTF-8 — producing makÃ¢ko instead of makâko. This is invisible in Python logs (terminals render mojibake as if it were correct) and surfaces only when a user sees it in the UI. We originally applied a fix at the component level, then audited six months later and found multiple components still leaked mojibake. The real fix is to sanitize at the data layer:

// frontend/src/utils/text.ts — the one canonical fixer
export function fixMojibake(input: string): string {
  if (!input || !looksLikeMojibake(input)) return input;
  try {
    // Round-trip: interpret characters as Latin-1 bytes, re-decode as UTF-8.
    const bytes = Uint8Array.from(input, (c) => c.charCodeAt(0) & 0xff);
    return new TextDecoder("utf-8").decode(bytes);
  } catch {
    return replacementFallback(input); // character-map fallback
  }
}

Then we moved every API response through fixMojibake at a single choke point — the API client — so no component could forget. A CI lint rule now flags any new component that fetches .sango / .french fields without going through the canonical fix. If you find the same bug class twice, move the fix one layer deeper.

2. Tokenizers disagree on diacritics. Off-the-shelf BPE tokenizers trained on English/French-dominant corpora treat â as a byte-boundary and split makâko into three or four awkward subwords. This inflates your prompt token count (sometimes 2×) and — more subtly — changes the model's character-level reasoning about the word's morphology. We now pre-normalize Sango text in a way that preserves diacritics as single grapheme units for display and prompting, and we publish this normalization logic alongside the dataset so researchers can replicate exactly.

3. Soft-delete consistency bit us hard. Our admin UI sets status=deleted AND deleted_at when a vocabulary entry is removed. But ad-hoc cleanup scripts (ran over eight months by multiple contributors) only set status=deleted. The public API filtered on deleted_at IS NULL only. Result: soft-deleted entries leaked into the public dataset. When we finally audited the full table, we found 998 rows claimed — and needed two passes to get to ground truth: a soft-delete filter fix, then a deduplication sweep (spelling variants and cross-batch duplicates had accumulated). Net result: 998 claimed → 740 in the table → 611 → 581 production-verified and active today. The fix for each pass was cheap; finding the compound problem took weeks. Lesson: for any soft-delete scheme, all deletion paths must set all markers, all filter queries must check all markers, and you should run an explicit deduplication audit before you publish a word-count claim anywhere. Document the canonical protocol once and make violations lint-catchable.

5. The dataset — `MEYNG/sango-vocabulary`

581 Sango entries (production-verified and active in the live system), published as a HuggingFace dataset under a permissive license for research and commercial reuse.

from datasets import load_dataset

ds = load_dataset("MEYNG/sango-vocabulary")
# DatasetDict({
#   'train': Dataset({
#     features: ['sango', 'french', 'english', 'category', 'difficulty',
#                'example_sango', 'example_french', 'pronunciation', 'source'],
#     num_rows: 581
#   })
# })

How it was built:

Textbook extraction. The primary source is the Kîrîndönî primer — the CAR primary-school Sango curriculum, 50 pages, covering core vocabulary, greetings, grammar, numbers, body parts, animals, foods, verbs of daily life. Four batches of photo-captured pages OCR'd and cleaned into structured CSVs.
Native-speaker review. Every entry cross-referenced with at least one native Sango speaker. Approximately 15% of textbook OCR extractions were corrected at this stage: typos in the source, OCR artifacts on tonal diacritics, and a handful of substantive corrections where the textbook reflected an older orthographic convention than current usage (sawa → sêwa for "family," for example).
Deduplication. Merged across batches; verified orthographic consistency against the standardized CAR Sango orthography developed by linguist Marcel Diki-Kidiri and adopted by CAR's Ministry of Education in the 1980s (seven phonemic vowels a e ɛ i o ɔ u, usually written a e ë i o ö u plus the tonal diacritics on vowels).
Promotion. Entries pass through pending → review → verified. Only verified entries are served by the production API and included in the public dataset. The three-state workflow means community contributions can be accepted without compromising the published quality bar.

What the schema is not: this is not a parallel sentence corpus. It's a lexical database with illustrative example sentences. For the prompt-augmentation method described in section 3, a lexical database is what you need — parallel sentences are a different tool for a different job.

Growth roadmap. We target 2,000 verified entries by end of 2026 and 5,000 by end of 2027. If you are a native Sango speaker, a linguist working on Ubangian languages, or a researcher interested in contributing, the dataset accepts PRs through HuggingFace — and we specifically welcome domain-specific additions (medical, legal, agricultural, financial vocabulary).

6. What's next

Sango is language one. Our roadmap is geographic rather than popularity-driven, because clustering by region compounds better:

Ewondo (Cameroon, ~2M speakers, Beti cluster) — Q3 2026. Personal connection to Cameroon means faster native-speaker access; Ewondo also unlocks a Central Africa cluster story for partners operating across CAR + Cameroon (Orange, MSF, UNICEF).
Lingala (DRC / Republic of Congo, ~25M speakers) — H1 2027. Completes the Central Africa language triad.
Wolof (Senegal / Gambia, ~10M speakers) — H2 2027.
Bambara (Mali / Burkina Faso, ~14M speakers) — Q2 2028.
Kirundi (Burundi, ~12M speakers) — Q3 2028.

The engineering pattern is identical for every language: curate vocabulary, write grammar rules, fit to the prompt template, ship. No retraining, no new infrastructure, no MLOps complexity per language. That's what makes the methodology scale to 2,000+ languages — and also what makes it worth publishing, because any researcher or NGO can apply it to their language of interest without building our stack.

We are actively looking for native-speaker contributors and academic partners for each of the roadmap languages. If you'd like to collaborate on one of them, reach out.

Sango is language one of roughly 2,000 that AI currently cannot read. The method in this post isn't the final answer — it's the minimum viable answer, shipped to production for one language, and written down so others can apply it to the next. If you're working on language 2, 17, or 1,847, the dataset is open, the prompt template is above, and the door is open. The sooner this stops being novel, the better.

7. Try it, use it, cite it

Live translation — sangoai.sbs — free web app, no account required for Sango ↔ French ↔ English translation.

WhatsApp learning bot — send APPRENDRE to +237 658 763 678. Published through Meta's WhatsApp Cloud API; any WhatsApp user can reach it.

Dataset — MEYNG/sango-vocabulary on HuggingFace.

NPM package — @meyng/sango-nlp — tokenizer, language detection, stemmer. Diacritic-aware by design.

Cite as:

@misc{wenezoui2026sangoai,
  author       = {Wenezoui, Michel},
  title        = {SangoAI: Vocabulary-Augmented Prompting for Zero-Resource
                  African Language Translation},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/blog/MEYNG/sangoai}},
  note         = {Dataset: \url{https://huggingface.co/datasets/MEYNG/sango-vocabulary}}
}

Acknowledgments

This work would not exist without the native Sango speakers who patiently corrected our textbook extractions, the CAR primary-school curriculum that provided a foundational lexicon, and the open-source HuggingFace community that makes publishing low-resource language datasets feasible for solo developers. Special thanks to the Masakhane community for setting the standard for African NLP contribution workflows, and to the linguistics faculty at INALCO (Paris) and Université de Yaoundé I for informal guidance on Ubangian language structure.

Michel WENEZOUI is the founder of MEYNG, an African language AI infrastructure company. He is a native Sango speaker from the Central African Republic with family ties to Cameroon. If you're building AI for an under-resourced language — or you want to — his LinkedIn DMs are open: linkedin.com/in/mwenezoui.

Datasets mentioned in this article 1