Description
This tokenizer is of type Unigram, supporting both English and Vietnamese.
Along with tokenization, this tokenizer also does diacritics normalization (for Vietnamese). For example: hóa → hoá, hủy → huỷ.
Details
Library used to train
https://github.com/google/sentencepiece
Training Data
https://huggingface.co/datasets/levuloihust/vien-corpus-for-tokenizer
Training script
./spm_train \
--input=vien-corpus.txt \
--model_prefix=vien \
--vocab_size=64000 \
--user_defined_symbols_file=user_defined_symbols.txt \
--required_chars_file=required_chars.txt \
--unk_surface="<unk>" \
--byte_fallback=false \
--split_by_unicode_script=true \
--split_by_number=true \
--split_digits=true \
--normalization_rule_tsv=nmt_nfkc_vidiacritic.tsv
spm_train is the executable file built by following installation guide in https://github.com/google/sentencepiece. Other files (user_defined_symbols.txt, required_chars.txt and nmt_nfkc_vidiacritic.tsv) are provided in this repo.
The training script should be run on a machine with 64GB RAM. After training, we get two files vien.model and vien.vocab.
Convert SPM model to HuggingFace tokenizer
Run the following python script to convert SPM model to HuggingFace tokenizer.
from transformers import DebertaV2Tokenizer
tokenizer = DebertaV2Tokenizer(
vocab_file="assets/spm/vien.model",
do_lower_case=False,
split_by_punct=False,
bos_token="<s>",
eos_token="</s>",
unk_token="<unk>",
sep_token="<sep>",
pad_token="<pad>",
cls_token="<cls>",
mask_token="<mask>"
)
tokenizer.save_pretrained("assets/hf-tokenizer")
Replace assets/spm/vien.model and assets/hf-tokenizer with the correct path on your local machine.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("levuloihust/vien-unigram-tokenizer", use_fast=False)
tokens = tokenizer.tokenize("How are you? Thời tiết hôm nay đẹp wóa trời lun =))")
print(tokens)
# ['▁How', '▁are', '▁you', '?', '▁Thời', '▁tiết', '▁hôm', '▁nay', '▁đẹp', '▁wo', 'á', '▁trời', '▁lun', '▁=))']
Note that you must set use_fast=False for the tokenizer to properly function. In case use_fast=True (default), the tokenizer cannot perform normalization (Note that in the usage example, wóa was changed to woá)
Contact information
For personal communication related to this project, please contact Loi Le Vu ([email protected]).