Dragneel
/

whisper-small-nepali

Automatic Speech Recognition

Generated from Trainer

Model card Files Files and versions

Metrics Training metrics Community

whisper-small-nepali / README.md

Dragneel's picture

Create README.md

aad0e5f verified 13 days ago

|

history blame contribute delete

2.45 kB

	---
	license: apache-2.0
	language:
	- ne
	metrics:
	- wer
	base_model:
	- openai/whisper-small
	tags:
	- automatic-speech-recognition
	- whisper
	- openslr
	- generated_from_trainer
	- speech
	model-index:
	- name: Whisper Small Nepali (OpenSLR)
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: OpenSLR 54 (Nepali Speech Corpus)
	type: openslr
	metrics:
	- type: wer
	value: 26.69
	name: Wer
	---

	# Whisper Small Fine-tuned on Nepali (OpenSLR 54)

	This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the OpenSLR 54 (Nepali Speech Corpus) dataset. It achieves state-of-the-art results for an open-source small model on this benchmark, trained on a massive 154-hour dataset.

	## Model Details

	### Model Description

	- Model architecture: Whisper Small (244M Parameters)
	- Language: Nepali (ne)
	- Task: Automatic Speech Recognition (Transcription)
	- Dataset: OpenSLR 54 (~157,000 utterances)
	- Fine-tuning Hardware: NVIDIA A100 80GB

	## Usage

	```python
	from transformers import pipeline

	transcriber = pipeline("automatic-speech-recognition", model="fnawaraj/whisper-small-nepali-openslr")

	# Transcribe an audio file
	transcription = transcriber("path_to_nepali_audio.mp3")

	print(transcription["text"])
	```

	## Training Data

	The model was trained on the OpenSLR 54 (Nepali Speech Corpus).

	Total Audio Duration: ~154 Hours

	Total Utterances: 157,905

	Sampling Rate: 16kHz

	### Training Procedure
	#### Training Hyperparameters

	The following hyperparameters were used during training:

	Learning Rate: 1e-05

	Train Batch Size: 8

	Eval Batch Size: 8

	Gradient Accumulation Steps: 4 (Effective Batch Size: 32)

	Optimizer: AdamW

	LR Scheduler: Linear decay with warmup (500 steps)

	Training Steps: 10,000

	Mixed Precision: FP16

	## Evaluation Results

	The model was evaluated on the unseen test split of the OpenSLR dataset (1,580 samples).
	Metric Score
	Word Error Rate (WER) 26.69%
	Validation Loss 0.210
	## Limitations

	The model performs best on read speech (high quality).

	It may struggle with extremely fast conversational speech or heavy background noise compared to models trained on diverse noise datasets.

	Some phonetic spelling variations (e.g., short vs long vowels) may occur as they sound identical in spoken Nepali.