Dragneel's picture
Create README.md
aad0e5f verified
metadata
license: apache-2.0
language:
  - ne
metrics:
  - wer
base_model:
  - openai/whisper-small
tags:
  - automatic-speech-recognition
  - whisper
  - openslr
  - generated_from_trainer
  - speech
model-index:
  - name: Whisper Small Nepali (OpenSLR)
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: OpenSLR 54 (Nepali Speech Corpus)
          type: openslr
        metrics:
          - type: wer
            value: 26.69
            name: Wer

Whisper Small Fine-tuned on Nepali (OpenSLR 54)

This model is a fine-tuned version of openai/whisper-small on the OpenSLR 54 (Nepali Speech Corpus) dataset. It achieves state-of-the-art results for an open-source small model on this benchmark, trained on a massive 154-hour dataset.

Model Details

Model Description

  • Model architecture: Whisper Small (244M Parameters)
  • Language: Nepali (ne)
  • Task: Automatic Speech Recognition (Transcription)
  • Dataset: OpenSLR 54 (~157,000 utterances)
  • Fine-tuning Hardware: NVIDIA A100 80GB

Usage

from transformers import pipeline

transcriber = pipeline("automatic-speech-recognition", model="fnawaraj/whisper-small-nepali-openslr")

# Transcribe an audio file
transcription = transcriber("path_to_nepali_audio.mp3")

print(transcription["text"])

Training Data

The model was trained on the OpenSLR 54 (Nepali Speech Corpus).

Total Audio Duration: ~154 Hours

Total Utterances: 157,905

Sampling Rate: 16kHz

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 1e-05

Train Batch Size: 8

Eval Batch Size: 8

Gradient Accumulation Steps: 4 (Effective Batch Size: 32)

Optimizer: AdamW

LR Scheduler: Linear decay with warmup (500 steps)

Training Steps: 10,000

Mixed Precision: FP16

Evaluation Results

The model was evaluated on the unseen test split of the OpenSLR dataset (1,580 samples). Metric Score Word Error Rate (WER) 26.69% Validation Loss 0.210

Limitations

The model performs best on read speech (high quality).

It may struggle with extremely fast conversational speech or heavy background noise compared to models trained on diverse noise datasets.

Some phonetic spelling variations (e.g., short vs long vowels) may occur as they sound identical in spoken Nepali.