🏔️ Kashmiri Tokenizer Suite

First systematic tokenization infrastructure for Kashmiri (ISO 639-3: kas) Five tokenizer architectures trained on the KS-LIT-3M 3.1M-word corpus.

Folder Architecture Vocab Best For
kashmiri_char_tokenizer/ Character-Level ~120 ASR, OCR, char-level LM
kashmiri_word_tokenizer/ Word-Level ~120K Lookup, bag-of-words
kashmiri_wordpiece_tokenizer/ WordPiece (BERT) 32K NER, POS, classification
kashmiri_bpe_tokenizer/ BPE (GPT) 32K Text generation, MT
kashmiri_unigram_tokenizer/ Unigram (SentencePiece) 32K Multilingual, NMT

Companion standalone Tokenizers

The other four KashTok tokenizers are also available for direct comparison:

Loading the KashTok Kashmiri Tokenizers

All five tokenizers from the KashTok study are available on the Hugging Face Hub as standard AutoTokenizer-compatible repositories. Each can be loaded in a single line and used immediately with the transformers library.

Installation

pip install transformers torch

For Unigram (SentencePiece) you may also need:

pip install sentencepiece

WordPiece — Recommended for BERT-style Encoders

Best DPS (0.9997). Use for BERT pre-training, NER, POS tagging, and classification.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_WordPiece_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Expected output:

['کٲشِر', 'زَبان', 'چھِ', '##یہٕ', 'خٲص', 'زَبان']

Unigram — Recommended for Multilingual and Morphology-Aware Tasks

Best CQS (0.8848) and best MBA (0.3867). Use for machine translation, multilingual transfer, and morphology-sensitive tasks.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Unigram_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Expected output:

['▁کٲشِر', '▁زَبان', '▁چھِیہٕ', '▁خٲص', '▁زَبان']

BPE — Recommended for GPT-style Generation

Use for autoregressive generation and machine translation.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_BPE_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Expected output:

['کٲشِر</w>', 'زَبان</w>', 'چھِ', 'یہٕ</w>', 'خٲ', 'ص</w>', 'زَبان</w>']

Character — Recommended for ASR/OCR and Character-Level Models

Zero out-of-vocabulary by construction. Full Kashmiri inventory coverage. Use for ASR/OCR post-correction, error analysis, and pure character-level models.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Char_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Expected output (each Unicode codepoint becomes one token):

['ک', 'ٲ', 'ش', 'ِ', 'ر', ' ', 'ز', 'َ', 'ب', 'ا', 'ن', ...]

Word — Whole-Word Lookup Baseline

Not recommended for production. 5.73% test-set OOV. Provided as a baseline for comparison only.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer")

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
encoding = tokenizer(text, return_tensors="pt")
print(tokenizer.tokenize(text))
print(encoding.input_ids)

Expected output (note the [UNK] tokens for out-of-vocabulary forms):

['کٲشِر', 'زَبان', '[UNK]', '[UNK]', 'زَبان']

Loading All Five at Once

For comparative work or evaluation pipelines:

from transformers import AutoTokenizer

tokenizers = {
    "Character": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Char_Tokenizer"),
    "Word":      AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Word_Tokenizer"),
    "WordPiece": AutoTokenizer.from_pretrained("Omarrran/Kashmiri_WordPiece_Tokenizer"),
    "BPE":       AutoTokenizer.from_pretrained("Omarrran/Kashmiri_BPE_Tokenizer"),
    "Unigram":   AutoTokenizer.from_pretrained("Omarrran/Kashmiri_Unigram_Tokenizer"),
}

text = "کٲشِر زَبان چھِیہٕ خٲص زَبان"
for name, tok in tokenizers.items():
    print(f"{name:<10}: {tok.tokenize(text)}")

Quick Reference Table

Tokenizer Repository Vocab Best for
Character Omarrran/Kashmiri_Char_Tokenizer 133 ASR/OCR, character-level models
Word Omarrran/Kashmiri_Word_Tokenizer 50,000 Lookup baselines only
WordPiece Omarrran/Kashmiri_WordPiece_Tokenizer 16,000 BERT-style encoders, NER
BPE Omarrran/Kashmiri_BPE_Tokenizer 16,000 GPT-style generation, MT
Unigram Omarrran/Kashmiri_Unigram_Tokenizer 16,000 Multilingual, morphology-aware

Novel Metrics Introduced

  • Diacritic Preservation Score (DPS) — OOV-penalized; measures diacritic attachment
  • Morphological Boundary Alignment (MBA) — IoU vs gold morpheme boundaries
  • Composite Quality Score (CQS) — weighted ranking across all metrics

Citation

@article{malik2026kashtok,
  title  = {KashTok: Tokenizing Kashmiri at Scale with Novel
            Diacritic- and Morphology-Aware Metrics},
  author = {Malik, Haq Nawaz and Nissar, Nahfid and others},
  year   = {2026}
}

Part of the Kashmiri NLP Infrastructure Project

· This tokenizer suite

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support