Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string

Whisper-Small Hindi ASR (LoRA Fine-tuned)

This is a fine-tuned version of openai/whisper-small for Hindi automatic speech recognition, using LoRA (Low-Rank Adaptation) for parameter-efficient training.

Model Details

Model Description

This model is a fine-tuned version of OpenAI's Whisper-small specifically for Hindi speech recognition. It was trained using LoRA (Low-Rank Adaptation) technique which allows for efficient parameter-efficient fine-tuning by only training a small subset of adapter parameters while keeping the base model frozen.

  • Developed by: Swayam Singal
  • Model type: Speech Recognition (ASR) with LoRA adapters
  • Language: Hindi
  • License: MIT
  • Finetuned from model: openai/whisper-small

Model Sources

Uses

Direct Use

This model can be used for:

  • Transcribing Hindi audio to text
  • Speech recognition applications in Hindi
  • Building voice assistants or transcription services for Hindi speakers

Downstream Use

This model can be fine-tuned further for specific domains or use cases requiring Hindi speech recognition.

Out-of-Scope Use

This model is specifically trained for Hindi and may not perform well on other languages. It is intended for general Hindi speech recognition and may not be suitable for specialized domains without additional fine-tuning.

Bias, Risks, and Limitations

As with all speech recognition models, this model may have biases related to:

  • Speaker demographics (age, gender, regional accents)
  • Audio quality variations
  • Domain-specific vocabulary

Recommendations

Users should be aware of these limitations and test the model on their specific use cases. For production use, additional evaluation and possibly domain-specific fine-tuning may be required.

How to Get Started with the Model

Using Transformers Library

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch
import librosa

# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="hindi", task="transcribe")

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "swayam8264/whisper-small-hi-lora")

# Load and preprocess audio
audio, sampling_rate = librosa.load("path/to/hindi/audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription[0])

Training Details

Training Data

The model was trained on approximately 10 hours of Hindi audio data from various sources. The training dataset includes:

  • Diverse Hindi speakers with different accents and dialects
  • Various recording conditions and audio qualities
  • Multiple domains and topics

Training Procedure

Preprocessing

  • Audio files were resampled to 16kHz
  • Text transcriptions were normalized
  • Audio segments were filtered to 1-30 seconds in duration

Training Hyperparameters

  • Training regime: Mixed precision (float32)
  • Batch size: 4 per device with gradient accumulation steps of 8
  • Learning rate: 5e-6 with warmup
  • Optimizer: AdamW
  • LoRA configuration:
    • Rank (r): 8
    • Alpha: 32
    • Dropout: 0.1
    • Target modules: q_proj, v_proj

Speeds, Sizes, Times

  • Hardware Type: Apple Silicon (M1/M2) or CPU
  • Training time: 2-8 hours depending on hardware
  • Checkpoint size: ~3.4 MB (LoRA adapters only)

Evaluation

Testing Data

The model was evaluated on the FLEURS Hindi test set, which provides a standardized benchmark for multilingual speech recognition.

Metrics

  • Word Error Rate (WER): Primary metric for evaluating ASR performance
  • Lower WER indicates better performance

Results

Model WER (%) Improvement
Baseline (Pretrained) 74.14 โ€”
Fine-tuned (LoRA) 68.12 6.02%

Summary

The fine-tuned model shows a 6.02% absolute improvement in WER over the baseline pretrained model on the FLEURS Hindi test set.

Environmental Impact

  • Hardware Type: Apple Silicon M-series or Intel Mac
  • Hours used: 2-8 hours (depending on hardware)
  • Carbon Emitted: Minimal (personal computing device usage)

Technical Specifications

Model Architecture and Objective

This model uses the Whisper architecture with LoRA adapters for efficient fine-tuning. The objective is to minimize the cross-entropy loss between predicted and actual token sequences.

Compute Infrastructure

Hardware

  • Apple Silicon (M1/M2/M3) with MPS acceleration
  • Intel Mac CPU fallback
  • Minimum 8GB RAM recommended

Software

  • Python 3.8+
  • PyTorch
  • Hugging Face Transformers
  • PEFT library
  • Datasets library

Citation

BibTeX:

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

APA: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.

Model Card Authors

Swayam Singal

Model Card Contact

For questions or issues, please contact swayam8264 on Hugging Face.

Framework versions

  • PEFT 0.18.0
  • Transformers 4.x
  • PyTorch 1.x
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for swayam8624/whisper-small-hi-lora

Adapter
(154)
this model