| | --- |
| | inference: false |
| | language: |
| | - bg |
| | license: mit |
| | datasets: |
| | - oscar |
| | - chitanka |
| | - wikipedia |
| | tags: |
| | - torch |
| | --- |
| | |
| | # GPT-2 |
| |
|
| | Pretrained model on Bulgarian language using a causal language modeling (CLM) objective. It was introduced in |
| | [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
| | and first released at [this page](https://openai.com/blog/better-language-models/). |
| |
|
| | ## Model description |
| |
|
| | This is the **SMALL** version. |
| |
|
| | The training data is Bulgarian text from [OSCAR](https://oscar-corpus.com/post/oscar-2019/), [Chitanka](https://chitanka.info/) and [Wikipedia](https://bg.wikipedia.org/). |
| |
|
| | ## Intended uses & limitations |
| |
|
| | You can use the raw model for: |
| | - text generation |
| | - auto-complete |
| | - spelling correction |
| |
|
| | Or fine-tune it to a downstream task. |
| |
|
| | ### How to use |
| |
|
| | Here is how to use this model in PyTorch: |
| |
|
| | ```python |
| | >>> from transformers import AutoModel, AutoTokenizer |
| | >>> |
| | >>> model_id = "rmihaylov/gpt2-small-bg" |
| | >>> tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | >>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True) |
| | >>> |
| | >>> input_ids = tokenizer.encode( |
| | >>> "Здравей,", |
| | >>> add_special_tokens=False, |
| | >>> return_tensors='pt') |
| | >>> |
| | >>> output_ids = model.generate( |
| | >>> input_ids, |
| | >>> do_sample=True, |
| | >>> max_length=50, |
| | >>> top_p=0.92, |
| | >>> pad_token_id=2, |
| | >>> top_k=0) |
| | >>> |
| | >>> output = tokenizer.decode(output_ids[0]) |
| | >>> |
| | >>> output = output.replace('<|endoftext|>', '\n\n\n') |
| | >>> output = output.replace('<|unknown|>', '') |
| | >>> output = output.replace('▁', ' ') |
| | >>> output = output.replace('<|n|>', '\n') |
| | >>> |
| | >>> print(output) |
| | |
| | Здравей, Ани! Не е ли прекрасно? |
| | Нещото се засмя. Зъбите му блеснаха. |
| | — Ще те разведа насам-натам! |
| | Ани се замисли, когато той си тръгна. Може би не искаше да го е |
| | ``` |
| |
|
| | ### Limitations and bias |
| |
|
| | As the openAI team themselves point out in their |
| | [model card](https://github.com/openai/gpt-2/blob/master/model_card.md#out-of-scope-use-cases): |
| |
|
| | > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases |
| | > that require the generated text to be true. |
| | > |
| | > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do |
| | > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a |
| | > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, |
| | > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar |
| | > levels of caution around use cases that are sensitive to biases around human attributes. |