VoiceAPI / README.md
Harshil748's picture
Refactor: Hide model loading, focus on training pipeline
b0dbe7f
metadata
title: VoiceAPI
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
tags:
  - tts
  - text-to-speech
  - indian-languages
  - vits
  - multilingual
  - speech-synthesis

🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.

Live API: https://huggingface.co/proxy/harshil748-voiceapi.hf.space

🌟 Features

  • 11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
  • 21 Voice Options: Male and female voices for each language
  • High-Quality Audio: 22050 Hz sample rate, natural prosody
  • REST API: Simple GET/POST endpoints for easy integration
  • Real-time Synthesis: Fast inference on CPU/GPU

🗣️ Supported Languages

Language Code Female Male Script
Hindi hi देवनागरी
Bengali bn বাংলা
Marathi mr देवनागरी
Telugu te తెలుగు
Kannada kn ಕನ್ನಡ
Gujarati gu - ગુજરાતી
Bhojpuri bho देवनागरी
Chhattisgarhi hne देवनागरी
Maithili mai देवनागरी
Magahi mag देवनागरी
English en Latin

📡 API Usage

Endpoint

``` GET/POST /Get_Inference ```

Parameters

Parameter Type Required Description
`text` string Yes Text to synthesize (lowercase for English)
`lang` string Yes Language name (hindi, bengali, etc.)
`speaker_wav` file Yes Reference WAV file (for API compatibility)

Example (Python)

```python import requests

base_url = 'https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference' WavPath = 'reference.wav'

params = { 'text': 'नमस्ते, आप कैसे हैं?', 'lang': 'hindi', }

with open(WavPath, "rb") as AudioFile: response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200: with open('output.wav', 'wb') as f: f.write(response.content) print("Audio saved as 'output.wav'") ```

Example (cURL)

```bash curl -X POST "https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \ -F "[email protected]" \ -o output.wav ```

🏗️ Model Architecture

  • Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
  • Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
  • Decoder: HiFi-GAN neural vocoder
  • Duration Predictor: Stochastic duration predictor for natural prosody
  • Sample Rate: 22050 Hz (16000 Hz for Gujarati)

📊 Training

Datasets Used

Dataset Languages Hours Source License
OpenSLR-103 Hindi 24h OpenSLR CC BY 4.0
OpenSLR-37 Bengali 22h OpenSLR CC BY 4.0
OpenSLR-64 Marathi 30h OpenSLR CC BY 4.0
OpenSLR-66 Telugu 28h OpenSLR CC BY 4.0
OpenSLR-79 Kannada 26h OpenSLR CC BY 4.0
OpenSLR-78 Gujarati 25h OpenSLR CC BY 4.0
Common Voice Hindi, Bengali 50h+ Mozilla CC0
IndicTTS Multiple 100h+ IIT Madras Research
Indic-Voices Multiple 200h+ AI4Bharat CC BY 4.0

Training Configuration

  • Epochs: 1000
  • Batch Size: 32
  • Learning Rate: 2e-4
  • Optimizer: AdamW
  • FP16 Training: Enabled
  • Hardware: NVIDIA V100/A100 GPUs

Training Pipeline

  1. Data Preparation (`training/prepare_dataset.py`)

    • Download audio datasets
    • Normalize audio to 22050 Hz
    • Generate text transcriptions
    • Create train/val splits
  2. Model Training (`training/train_vits.py`)

    • Train VITS model with character-level tokenization
    • Multi-speaker training with speaker embeddings
    • Mixed precision training for efficiency
  3. Model Export (`training/export_model.py`)

    • Export trained models to JIT format
    • Generate vocabulary files (chars.txt)
    • Package for inference

See `training/` directory for full training scripts and configurations.

�� Project Structure

``` VoiceAPI/ ├── app.py # Application entry point ├── Dockerfile # Docker configuration ├── requirements.txt # Python dependencies ├── src/ │ ├── api.py # FastAPI REST server │ ├── engine.py # TTS inference engine │ ├── config.py # Voice configurations │ ├── tokenizer.py # Text tokenization │ └── model_loader.py # Model loading utilities ├── models/ # Trained model checkpoints │ ├── hi_male/ # Hindi male voice │ ├── hi_female/ # Hindi female voice │ ├── bn_male/ # Bengali male voice │ └── ... # Other voices └── training/ ├── train_vits.py # VITS training script ├── prepare_dataset.py # Data preparation ├── export_model.py # Model export ├── datasets.csv # Dataset links └── configs/ # Training configs ```

📜 License

  • Code: MIT License
  • Models: CC BY 4.0
  • Datasets: Individual licenses (see training/datasets.csv)

🙏 Acknowledgments

📧 Contact

Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.