Dragneel's picture
Create README.md
aad0e5f verified
---
license: apache-2.0
language:
- ne
metrics:
- wer
base_model:
- openai/whisper-small
tags:
- automatic-speech-recognition
- whisper
- openslr
- generated_from_trainer
- speech
model-index:
- name: Whisper Small Nepali (OpenSLR)
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: OpenSLR 54 (Nepali Speech Corpus)
type: openslr
metrics:
- type: wer
value: 26.69
name: Wer
---
# Whisper Small Fine-tuned on Nepali (OpenSLR 54)
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the **OpenSLR 54 (Nepali Speech Corpus)** dataset. It achieves state-of-the-art results for an open-source small model on this benchmark, trained on a massive 154-hour dataset.
## Model Details
### Model Description
- **Model architecture:** Whisper Small (244M Parameters)
- **Language:** Nepali (ne)
- **Task:** Automatic Speech Recognition (Transcription)
- **Dataset:** OpenSLR 54 (~157,000 utterances)
- **Fine-tuning Hardware:** NVIDIA A100 80GB
## Usage
```python
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="fnawaraj/whisper-small-nepali-openslr")
# Transcribe an audio file
transcription = transcriber("path_to_nepali_audio.mp3")
print(transcription["text"])
```
## Training Data
The model was trained on the OpenSLR 54 (Nepali Speech Corpus).
Total Audio Duration: ~154 Hours
Total Utterances: 157,905
Sampling Rate: 16kHz
### Training Procedure
#### Training Hyperparameters
The following hyperparameters were used during training:
Learning Rate: 1e-05
Train Batch Size: 8
Eval Batch Size: 8
Gradient Accumulation Steps: 4 (Effective Batch Size: 32)
Optimizer: AdamW
LR Scheduler: Linear decay with warmup (500 steps)
Training Steps: 10,000
Mixed Precision: FP16
## Evaluation Results
The model was evaluated on the unseen test split of the OpenSLR dataset (1,580 samples).
Metric Score
Word Error Rate (WER) 26.69%
Validation Loss 0.210
## Limitations
The model performs best on read speech (high quality).
It may struggle with extremely fast conversational speech or heavy background noise compared to models trained on diverse noise datasets.
Some phonetic spelling variations (e.g., short vs long vowels) may occur as they sound identical in spoken Nepali.