|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ne |
|
|
metrics: |
|
|
- wer |
|
|
base_model: |
|
|
- openai/whisper-small |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- openslr |
|
|
- generated_from_trainer |
|
|
- speech |
|
|
model-index: |
|
|
- name: Whisper Small Nepali (OpenSLR) |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: OpenSLR 54 (Nepali Speech Corpus) |
|
|
type: openslr |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 26.69 |
|
|
name: Wer |
|
|
--- |
|
|
|
|
|
# Whisper Small Fine-tuned on Nepali (OpenSLR 54) |
|
|
|
|
|
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the **OpenSLR 54 (Nepali Speech Corpus)** dataset. It achieves state-of-the-art results for an open-source small model on this benchmark, trained on a massive 154-hour dataset. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Model architecture:** Whisper Small (244M Parameters) |
|
|
- **Language:** Nepali (ne) |
|
|
- **Task:** Automatic Speech Recognition (Transcription) |
|
|
- **Dataset:** OpenSLR 54 (~157,000 utterances) |
|
|
- **Fine-tuning Hardware:** NVIDIA A100 80GB |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
transcriber = pipeline("automatic-speech-recognition", model="fnawaraj/whisper-small-nepali-openslr") |
|
|
|
|
|
# Transcribe an audio file |
|
|
transcription = transcriber("path_to_nepali_audio.mp3") |
|
|
|
|
|
print(transcription["text"]) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the OpenSLR 54 (Nepali Speech Corpus). |
|
|
|
|
|
Total Audio Duration: ~154 Hours |
|
|
|
|
|
Total Utterances: 157,905 |
|
|
|
|
|
Sampling Rate: 16kHz |
|
|
|
|
|
### Training Procedure |
|
|
#### Training Hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
|
|
|
Learning Rate: 1e-05 |
|
|
|
|
|
Train Batch Size: 8 |
|
|
|
|
|
Eval Batch Size: 8 |
|
|
|
|
|
Gradient Accumulation Steps: 4 (Effective Batch Size: 32) |
|
|
|
|
|
Optimizer: AdamW |
|
|
|
|
|
LR Scheduler: Linear decay with warmup (500 steps) |
|
|
|
|
|
Training Steps: 10,000 |
|
|
|
|
|
Mixed Precision: FP16 |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
The model was evaluated on the unseen test split of the OpenSLR dataset (1,580 samples). |
|
|
Metric Score |
|
|
Word Error Rate (WER) 26.69% |
|
|
Validation Loss 0.210 |
|
|
## Limitations |
|
|
|
|
|
The model performs best on read speech (high quality). |
|
|
|
|
|
It may struggle with extremely fast conversational speech or heavy background noise compared to models trained on diverse noise datasets. |
|
|
|
|
|
Some phonetic spelling variations (e.g., short vs long vowels) may occur as they sound identical in spoken Nepali. |