🏔️ Kashmiri Tokenizer Suite
First systematic tokenization infrastructure for Kashmiri (ISO 639-3: kas) Five tokenizer architectures trained on the KS-LIT-3M 3.1M-word corpus.
| Folder | Architecture | Vocab | Best For |
|---|---|---|---|
kashmiri_char_tokenizer/ |
Character-Level | ~120 | ASR, OCR, char-level LM |
kashmiri_word_tokenizer/ |
Word-Level | ~120K | Lookup, bag-of-words |
kashmiri_wordpiece_tokenizer/ |
WordPiece (BERT) | 32K | NER, POS, classification |
kashmiri_bpe_tokenizer/ |
BPE (GPT) | 32K | Text generation, MT |
kashmiri_unigram_tokenizer/ |
Unigram (SentencePiece) | 32K | Multilingual, NMT |
Companion standalone Tokenizers
The other four KashTok tokenizers are also available for direct comparison:
- Kashmiri_Char_Tokenizer
- Kashmiri_Word_Tokenizer
- Kashmiri_WordPiece_Tokenizer
- Kashmiri_BPE_Tokenizer
- Kashmiri_Unigram_Tokenizer
Loading the KashTok Kashmiri Tokenizers
All five tokenizers from the KashTok study are available on the Hugging Face Hub as standard AutoTokenizer-compatible repositories. Each can be loaded in a single line and used immediately with the transformers library.
Installation
pip install transformers torch
For Unigram (SentencePiece) you may also need:
pip install sentencepiece
WordPiece — Recommended for BERT-style Encoders
Best DPS (0.9997). Use for BERT pre-training, NER, POS tagging, and classification.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_WordPiece_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Expected output:
['کٲشِر', 'زَبان', 'چھِ', '##یہٕ', 'خٲص', 'زَبان']
Unigram — Recommended for Multilingual and Morphology-Aware Tasks
Best CQS (0.8848) and best MBA (0.3867). Use for machine translation, multilingual transfer, and morphology-sensitive tasks.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Unigram_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Expected output:
['▁کٲشِر', '▁زَبان', '▁چھِیہٕ', '▁خٲص', '▁زَبان']
BPE — Recommended for GPT-style Generation
Use for autoregressive generation and machine translation.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_BPE_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Expected output:
['کٲشِر</w>', 'زَبان</w>', 'چھِ', 'یہٕ</w>', 'خٲ', 'ص</w>', 'زَبان</w>']
Character — Recommended for ASR/OCR and Character-Level Models
Zero out-of-vocabulary by construction. Full Kashmiri inventory coverage. Use for ASR/OCR post-correction, error analysis, and pure character-level models.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Char_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Expected output (each Unicode codepoint becomes one token):
['ک', 'ٲ', 'ش', 'ِ', 'ر', ' ', 'ز', 'َ', 'ب', 'ا', 'ن', ...]
Word — Whole-Word Lookup Baseline
Not recommended for production. 5.73% test-set OOV. Provided as a baseline for comparison only.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)
Expected output (note the [UNK] tokens for out-of-vocabulary forms):
['کٲشِر', 'زَبان', '[UNK]', '[UNK]', 'زَبان']
Loading All Five at Once
For comparative work or evaluation pipelines:
from transformers import AutoTokenizer
tokenizers = {
"Character": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Char_Tokenizer"),
"Word": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer"),
"WordPiece": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_WordPiece_Tokenizer"),
"BPE": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_BPE_Tokenizer"),
"Unigram": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Unigram_Tokenizer"),
}
text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
for name, tok in tokenizers.items():
print(f"{name:<10}: {tok.tokenize(text)}")
Quick Reference Table
| Tokenizer | Repository | Vocab | Best for |
|---|---|---|---|
| Character | Omarrran/Kashmiri_Char_Tokenizer |
133 | ASR/OCR, character-level models |
| Word | Omarrran/Kashmiri_Word_Tokenizer |
50,000 | Lookup baselines only |
| WordPiece | Omarrran/Kashmiri_WordPiece_Tokenizer |
16,000 | BERT-style encoders, NER |
| BPE | Omarrran/Kashmiri_BPE_Tokenizer |
16,000 | GPT-style generation, MT |
| Unigram | Omarrran/Kashmiri_Unigram_Tokenizer |
16,000 | Multilingual, morphology-aware |
Novel Metrics Introduced
- Diacritic Preservation Score (DPS) — OOV-penalized; measures diacritic attachment
- Morphological Boundary Alignment (MBA) — IoU vs gold morpheme boundaries
- Composite Quality Score (CQS) — weighted ranking across all metrics
Citation
@article{malik2026kashtok,
title = {KashTok: Tokenizing Kashmiri at Scale with Novel
Diacritic- and Morphology-Aware Metrics},
author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
year = {2026}
}
Part of the Kashmiri NLP Infrastructure Project
· This tokenizer suite