Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

Whisper-Small Hindi ASR (LoRA Fine-tuned)

This is a fine-tuned version of openai/whisper-small for Hindi automatic speech recognition, using LoRA (Low-Rank Adaptation) for parameter-efficient training.

Model Details

Model Description

This model is a fine-tuned version of OpenAI's Whisper-small specifically for Hindi speech recognition. It was trained using LoRA (Low-Rank Adaptation) technique which allows for efficient parameter-efficient fine-tuning by only training a small subset of adapter parameters while keeping the base model frozen.

Developed by: Swayam Singal
Model type: Speech Recognition (ASR) with LoRA adapters
Language: Hindi
License: MIT
Finetuned from model: openai/whisper-small

Model Sources

Repository: TaskJosh on Hugging Face
Paper: Robust Speech Recognition via Large-Scale Weak Supervision

Uses

Direct Use

This model can be used for:

Transcribing Hindi audio to text
Speech recognition applications in Hindi
Building voice assistants or transcription services for Hindi speakers

Downstream Use

This model can be fine-tuned further for specific domains or use cases requiring Hindi speech recognition.

Out-of-Scope Use

This model is specifically trained for Hindi and may not perform well on other languages. It is intended for general Hindi speech recognition and may not be suitable for specialized domains without additional fine-tuning.

Bias, Risks, and Limitations

As with all speech recognition models, this model may have biases related to:

Speaker demographics (age, gender, regional accents)
Audio quality variations
Domain-specific vocabulary

Recommendations

Users should be aware of these limitations and test the model on their specific use cases. For production use, additional evaluation and possibly domain-specific fine-tuning may be required.

How to Get Started with the Model

Using Transformers Library

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch
import librosa

# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="hindi", task="transcribe")

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "swayam8264/whisper-small-hi-lora")

# Load and preprocess audio
audio, sampling_rate = librosa.load("path/to/hindi/audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription[0])

Training Details

Training Data

The model was trained on approximately 10 hours of Hindi audio data from various sources. The training dataset includes:

Diverse Hindi speakers with different accents and dialects
Various recording conditions and audio qualities
Multiple domains and topics

Training Procedure

Preprocessing

Audio files were resampled to 16kHz
Text transcriptions were normalized
Audio segments were filtered to 1-30 seconds in duration

Training Hyperparameters

Training regime: Mixed precision (float32)
Batch size: 4 per device with gradient accumulation steps of 8
Learning rate: 5e-6 with warmup
Optimizer: AdamW
LoRA configuration:
- Rank (r): 8
- Alpha: 32
- Dropout: 0.1
- Target modules: q_proj, v_proj

Speeds, Sizes, Times

Hardware Type: Apple Silicon (M1/M2) or CPU
Training time: 2-8 hours depending on hardware
Checkpoint size: ~3.4 MB (LoRA adapters only)

Evaluation

Testing Data

The model was evaluated on the FLEURS Hindi test set, which provides a standardized benchmark for multilingual speech recognition.

Metrics

Word Error Rate (WER): Primary metric for evaluating ASR performance
Lower WER indicates better performance

Results

Model	WER (%)	Improvement
Baseline (Pretrained)	74.14	—
Fine-tuned (LoRA)	68.12	6.02%

Summary

The fine-tuned model shows a 6.02% absolute improvement in WER over the baseline pretrained model on the FLEURS Hindi test set.

Environmental Impact

Hardware Type: Apple Silicon M-series or Intel Mac
Hours used: 2-8 hours (depending on hardware)
Carbon Emitted: Minimal (personal computing device usage)

Technical Specifications

Model Architecture and Objective

This model uses the Whisper architecture with LoRA adapters for efficient fine-tuning. The objective is to minimize the cross-entropy loss between predicted and actual token sequences.

Compute Infrastructure

Hardware

Apple Silicon (M1/M2/M3) with MPS acceleration
Intel Mac CPU fallback
Minimum 8GB RAM recommended

Software

Python 3.8+
PyTorch
Hugging Face Transformers
PEFT library
Datasets library

Citation

BibTeX:

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

APA: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.

Model Card Authors

Swayam Singal

Model Card Contact

For questions or issues, please contact swayam8264 on Hugging Face.

Framework versions

PEFT 0.18.0
Transformers 4.x
PyTorch 1.x

Downloads last month: 18

Model tree for swayam8624/whisper-small-hi-lora

Base model

openai/whisper-small

Adapter

(154)

this model