Spaces:

Harshil748
/

VoiceAPI

Running

File size: 6,621 Bytes

ecde958
d722140
 
ecde958
 
 
 
 
d722140
 
 
 
 
 
 
ecde958
 
d722140
ecde958
d722140
ecde958
b0dbe7f
ecde958
d722140
ecde958
d722140
 
 
 
 
ecde958
d722140
 
 
 
 
 
 
 
 
b0dbe7f
d722140
 
 
 
 
 
 
 
 
 
b0dbe7f
 
 
ecde958
 
d722140
ecde958
 
d722140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0dbe7f
d722140
 
 
 
 
b0dbe7f
 
 
 
 
 
 
 
 
 
 
d722140
 
 
 
 
 
 
 
 
 
b0dbe7f
d722140
b0dbe7f
 
 
 
 
d722140
b0dbe7f
 
 
 
d722140
b0dbe7f
 
 
 
d722140
b0dbe7f
d722140
b0dbe7f
d722140
 
 
b0dbe7f
d722140
 
 
 
 
 
b0dbe7f
 
 
 
 
 
 
d722140
 
 
 
 
 
 
 
 
 
 
b0dbe7f
d722140
ecde958
d722140
ecde958
b0dbe7f
 
d722140
 
b0dbe7f
ecde958
d722140
ecde958
d722140

---
title: VoiceAPI
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
license: mit
tags:
  - tts
  - text-to-speech
  - indian-languages
  - vits
  - multilingual
  - speech-synthesis
---

# 🎙️ VoiceAPI - Multi-lingual Indian Language TTS

An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options.

**Live API**: [https://huggingface.co/proxy/harshil748-voiceapi.hf.space](https://huggingface.co/proxy/harshil748-voiceapi.hf.space)

## 🌟 Features

- **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
- **21 Voice Options**: Male and female voices for each language
- **High-Quality Audio**: 22050 Hz sample rate, natural prosody
- **REST API**: Simple GET/POST endpoints for easy integration
- **Real-time Synthesis**: Fast inference on CPU/GPU

## 🗣️ Supported Languages

| Language | Code | Female | Male | Script |
|----------|------|--------|------|--------|
| Hindi | hi | ✅ | ✅ | देवनागरी |
| Bengali | bn | ✅ | ✅ | বাংলা |
| Marathi | mr | ✅ | ✅ | देवनागरी |
| Telugu | te | ✅ | ✅ | తెలుగు |
| Kannada | kn | ✅ | ✅ | ಕನ್ನಡ |
| Gujarati | gu | ✅ | - | ગુજરાતી |
| Bhojpuri | bho | ✅ | ✅ | देवनागरी |
| Chhattisgarhi | hne | ✅ | ✅ | देवनागरी |
| Maithili | mai | ✅ | ✅ | देवनागरी |
| Magahi | mag | ✅ | ✅ | देवनागरी |
| English | en | ✅ | ✅ | Latin |

## 📡 API Usage

### Endpoint

\`\`\`
GET/POST /Get_Inference
\`\`\`

### Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| \`text\` | string | Yes | Text to synthesize (lowercase for English) |
| \`lang\` | string | Yes | Language name (hindi, bengali, etc.) |
| \`speaker_wav\` | file | Yes | Reference WAV file (for API compatibility) |

### Example (Python)

\`\`\`python
import requests

base_url = 'https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference'
WavPath = 'reference.wav'

params = {
    'text': 'नमस्ते, आप कैसे हैं?',
    'lang': 'hindi',
}

with open(WavPath, "rb") as AudioFile:
    response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

if response.status_code == 200:
    with open('output.wav', 'wb') as f:
        f.write(response.content)
    print("Audio saved as 'output.wav'")
\`\`\`

### Example (cURL)

\`\`\`bash
curl -X POST "https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\
  -F "speaker[email protected]" \\
  -o output.wav
\`\`\`

## 🏗️ Model Architecture

- **Base Model**: VITS (Variational Inference with adversarial learning for Text-to-Speech)
- **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels)
- **Decoder**: HiFi-GAN neural vocoder
- **Duration Predictor**: Stochastic duration predictor for natural prosody
- **Sample Rate**: 22050 Hz (16000 Hz for Gujarati)

## 📊 Training

### Datasets Used

| Dataset | Languages | Hours | Source | License |
|---------|-----------|-------|--------|---------|
| OpenSLR-103 | Hindi | 24h | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 |
| OpenSLR-37 | Bengali | 22h | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 |
| OpenSLR-64 | Marathi | 30h | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 |
| OpenSLR-66 | Telugu | 28h | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 |
| OpenSLR-79 | Kannada | 26h | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 |
| OpenSLR-78 | Gujarati | 25h | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 |
| Common Voice | Hindi, Bengali | 50h+ | [Mozilla](https://commonvoice.mozilla.org/) | CC0 |
| IndicTTS | Multiple | 100h+ | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research |
| Indic-Voices | Multiple | 200h+ | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 |

### Training Configuration

- **Epochs**: 1000
- **Batch Size**: 32
- **Learning Rate**: 2e-4
- **Optimizer**: AdamW
- **FP16 Training**: Enabled
- **Hardware**: NVIDIA V100/A100 GPUs

### Training Pipeline

1. **Data Preparation** (\`training/prepare_dataset.py\`)
   - Download audio datasets
   - Normalize audio to 22050 Hz
   - Generate text transcriptions
   - Create train/val splits

2. **Model Training** (\`training/train_vits.py\`)
   - Train VITS model with character-level tokenization
   - Multi-speaker training with speaker embeddings
   - Mixed precision training for efficiency

3. **Model Export** (\`training/export_model.py\`)
   - Export trained models to JIT format
   - Generate vocabulary files (chars.txt)
   - Package for inference

See \`training/\` directory for full training scripts and configurations.

## �� Project Structure

\`\`\`
VoiceAPI/
├── app.py                 # Application entry point
├── Dockerfile             # Docker configuration
├── requirements.txt       # Python dependencies
├── src/
│   ├── api.py             # FastAPI REST server
│   ├── engine.py          # TTS inference engine
│   ├── config.py          # Voice configurations
│   ├── tokenizer.py       # Text tokenization
│   └── model_loader.py    # Model loading utilities
├── models/                # Trained model checkpoints
│   ├── hi_male/           # Hindi male voice
│   ├── hi_female/         # Hindi female voice
│   ├── bn_male/           # Bengali male voice
│   └── ...                # Other voices
└── training/
    ├── train_vits.py      # VITS training script
    ├── prepare_dataset.py # Data preparation
    ├── export_model.py    # Model export
    ├── datasets.csv       # Dataset links
    └── configs/           # Training configs
\`\`\`

## 📜 License

- **Code**: MIT License
- **Models**: CC BY 4.0
- **Datasets**: Individual licenses (see training/datasets.csv)

## 🙏 Acknowledgments

- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for Indian language speech research
- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for multilingual TTS
- [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
- [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources
- [OpenSLR](https://www.openslr.org/) for speech datasets

## 📧 Contact

Built for the **Voice Tech for All** Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.