Spaces:

Harshil748
/

VoiceAPI

Running

App Files Files Community

VoiceAPI / README.md

Harshil748

Refactor: Hide model loading, focus on training pipeline

b0dbe7f 11 days ago

preview code

raw

history blame contribute delete

6.62 kB

metadata

title: VoiceAPI
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
tags:
  - tts
  - text-to-speech
  - indian-languages
  - vits
  - multilingual
  - speech-synthesis

🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.

Live API: https://huggingface.co/proxy/harshil748-voiceapi.hf.space

🌟 Features

11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
21 Voice Options: Male and female voices for each language
High-Quality Audio: 22050 Hz sample rate, natural prosody
REST API: Simple GET/POST endpoints for easy integration
Real-time Synthesis: Fast inference on CPU/GPU

🗣️ Supported Languages

Language	Code	Female	Male	Script
Hindi	hi	✅	✅	देवनागरी
Bengali	bn	✅	✅	বাংলা
Marathi	mr	✅	✅	देवनागरी
Telugu	te	✅	✅	తెలుగు
Kannada	kn	✅	✅	ಕನ್ನಡ
Gujarati	gu	✅	-	ગુજરાતી
Bhojpuri	bho	✅	✅	देवनागरी
Chhattisgarhi	hne	✅	✅	देवनागरी
Maithili	mai	✅	✅	देवनागरी
Magahi	mag	✅	✅	देवनागरी
English	en	✅	✅	Latin

📡 API Usage

Endpoint

``` GET/POST /Get_Inference ```

Parameters

Parameter	Type	Required	Description
`text`	string	Yes	Text to synthesize (lowercase for English)
`lang`	string	Yes	Language name (hindi, bengali, etc.)
`speaker_wav`	file	Yes	Reference WAV file (for API compatibility)

Example (Python)

```python import requests

base_url = 'https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference' WavPath = 'reference.wav'

params = { 'text': 'नमस्ते, आप कैसे हैं?', 'lang': 'hindi', }

with open(WavPath, "rb") as AudioFile: response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200: with open('output.wav', 'wb') as f: f.write(response.content) print("Audio saved as 'output.wav'") ```

Example (cURL)

```bash curl -X POST "https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \ -F "[email protected]" \ -o output.wav ```

🏗️ Model Architecture

Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
Decoder: HiFi-GAN neural vocoder
Duration Predictor: Stochastic duration predictor for natural prosody
Sample Rate: 22050 Hz (16000 Hz for Gujarati)

📊 Training

Datasets Used

Dataset	Languages	Hours	Source	License
OpenSLR-103	Hindi	24h	OpenSLR	CC BY 4.0
OpenSLR-37	Bengali	22h	OpenSLR	CC BY 4.0
OpenSLR-64	Marathi	30h	OpenSLR	CC BY 4.0
OpenSLR-66	Telugu	28h	OpenSLR	CC BY 4.0
OpenSLR-79	Kannada	26h	OpenSLR	CC BY 4.0
OpenSLR-78	Gujarati	25h	OpenSLR	CC BY 4.0
Common Voice	Hindi, Bengali	50h+	Mozilla	CC0
IndicTTS	Multiple	100h+	IIT Madras	Research
Indic-Voices	Multiple	200h+	AI4Bharat	CC BY 4.0

Training Configuration

Epochs: 1000
Batch Size: 32
Learning Rate: 2e-4
Optimizer: AdamW
FP16 Training: Enabled
Hardware: NVIDIA V100/A100 GPUs

Training Pipeline

Data Preparation (`training/prepare_dataset.py`)
- Download audio datasets
- Normalize audio to 22050 Hz
- Generate text transcriptions
- Create train/val splits
Model Training (`training/train_vits.py`)
- Train VITS model with character-level tokenization
- Multi-speaker training with speaker embeddings
- Mixed precision training for efficiency
Model Export (`training/export_model.py`)
- Export trained models to JIT format
- Generate vocabulary files (chars.txt)
- Package for inference

See `training/` directory for full training scripts and configurations.

�� Project Structure

``` VoiceAPI/ ├── app.py # Application entry point ├── Dockerfile # Docker configuration ├── requirements.txt # Python dependencies ├── src/ │ ├── api.py # FastAPI REST server │ ├── engine.py # TTS inference engine │ ├── config.py # Voice configurations │ ├── tokenizer.py # Text tokenization │ └── model_loader.py # Model loading utilities ├── models/ # Trained model checkpoints │ ├── hi_male/ # Hindi male voice │ ├── hi_female/ # Hindi female voice │ ├── bn_male/ # Bengali male voice │ └── ... # Other voices └── training/ ├── train_vits.py # VITS training script ├── prepare_dataset.py # Data preparation ├── export_model.py # Model export ├── datasets.csv # Dataset links └── configs/ # Training configs ```

📜 License

Code: MIT License
Models: CC BY 4.0
Datasets: Individual licenses (see training/datasets.csv)

🙏 Acknowledgments

SYSPIN IISc SPIRE Lab for Indian language speech research
Facebook MMS for multilingual TTS
Coqui TTS for the TTS library
AI4Bharat for Indian language resources
OpenSLR for speech datasets

📧 Contact

Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.