Whisper-Small Hindi ASR (LoRA Fine-tuned)
This is a fine-tuned version of openai/whisper-small for Hindi automatic speech recognition, using LoRA (Low-Rank Adaptation) for parameter-efficient training.
Model Details
Model Description
This model is a fine-tuned version of OpenAI's Whisper-small specifically for Hindi speech recognition. It was trained using LoRA (Low-Rank Adaptation) technique which allows for efficient parameter-efficient fine-tuning by only training a small subset of adapter parameters while keeping the base model frozen.
- Developed by: Swayam Singal
- Model type: Speech Recognition (ASR) with LoRA adapters
- Language: Hindi
- License: MIT
- Finetuned from model: openai/whisper-small
Model Sources
- Repository: TaskJosh on Hugging Face
- Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Uses
Direct Use
This model can be used for:
- Transcribing Hindi audio to text
- Speech recognition applications in Hindi
- Building voice assistants or transcription services for Hindi speakers
Downstream Use
This model can be fine-tuned further for specific domains or use cases requiring Hindi speech recognition.
Out-of-Scope Use
This model is specifically trained for Hindi and may not perform well on other languages. It is intended for general Hindi speech recognition and may not be suitable for specialized domains without additional fine-tuning.
Bias, Risks, and Limitations
As with all speech recognition models, this model may have biases related to:
- Speaker demographics (age, gender, regional accents)
- Audio quality variations
- Domain-specific vocabulary
Recommendations
Users should be aware of these limitations and test the model on their specific use cases. For production use, additional evaluation and possibly domain-specific fine-tuning may be required.
How to Get Started with the Model
Using Transformers Library
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch
import librosa
# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="hindi", task="transcribe")
# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "swayam8264/whisper-small-hi-lora")
# Load and preprocess audio
audio, sampling_rate = librosa.load("path/to/hindi/audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Training Details
Training Data
The model was trained on approximately 10 hours of Hindi audio data from various sources. The training dataset includes:
- Diverse Hindi speakers with different accents and dialects
- Various recording conditions and audio qualities
- Multiple domains and topics
Training Procedure
Preprocessing
- Audio files were resampled to 16kHz
- Text transcriptions were normalized
- Audio segments were filtered to 1-30 seconds in duration
Training Hyperparameters
- Training regime: Mixed precision (float32)
- Batch size: 4 per device with gradient accumulation steps of 8
- Learning rate: 5e-6 with warmup
- Optimizer: AdamW
- LoRA configuration:
- Rank (r): 8
- Alpha: 32
- Dropout: 0.1
- Target modules: q_proj, v_proj
Speeds, Sizes, Times
- Hardware Type: Apple Silicon (M1/M2) or CPU
- Training time: 2-8 hours depending on hardware
- Checkpoint size: ~3.4 MB (LoRA adapters only)
Evaluation
Testing Data
The model was evaluated on the FLEURS Hindi test set, which provides a standardized benchmark for multilingual speech recognition.
Metrics
- Word Error Rate (WER): Primary metric for evaluating ASR performance
- Lower WER indicates better performance
Results
| Model | WER (%) | Improvement |
|---|---|---|
| Baseline (Pretrained) | 74.14 | โ |
| Fine-tuned (LoRA) | 68.12 | 6.02% |
Summary
The fine-tuned model shows a 6.02% absolute improvement in WER over the baseline pretrained model on the FLEURS Hindi test set.
Environmental Impact
- Hardware Type: Apple Silicon M-series or Intel Mac
- Hours used: 2-8 hours (depending on hardware)
- Carbon Emitted: Minimal (personal computing device usage)
Technical Specifications
Model Architecture and Objective
This model uses the Whisper architecture with LoRA adapters for efficient fine-tuning. The objective is to minimize the cross-entropy loss between predicted and actual token sequences.
Compute Infrastructure
Hardware
- Apple Silicon (M1/M2/M3) with MPS acceleration
- Intel Mac CPU fallback
- Minimum 8GB RAM recommended
Software
- Python 3.8+
- PyTorch
- Hugging Face Transformers
- PEFT library
- Datasets library
Citation
BibTeX:
@article{radford2022robust,
title={Robust speech recognition via large-scale weak supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
APA: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
Model Card Authors
Swayam Singal
Model Card Contact
For questions or issues, please contact swayam8264 on Hugging Face.
Framework versions
- PEFT 0.18.0
- Transformers 4.x
- PyTorch 1.x
- Downloads last month
- 18
Model tree for swayam8624/whisper-small-hi-lora
Base model
openai/whisper-small