Spaces:
Running
Running
| title: VoiceAPI | |
| emoji: ๐๏ธ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| license: mit | |
| tags: | |
| - tts | |
| - text-to-speech | |
| - indian-languages | |
| - vits | |
| - multilingual | |
| - speech-synthesis | |
| # ๐๏ธ VoiceAPI - Multi-lingual Indian Language TTS | |
| An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options. | |
| **Live API**: [https://huggingface.co/proxy/harshil748-voiceapi.hf.space](https://huggingface.co/proxy/harshil748-voiceapi.hf.space) | |
| ## ๐ Features | |
| - **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English | |
| - **21 Voice Options**: Male and female voices for each language | |
| - **High-Quality Audio**: 22050 Hz sample rate, natural prosody | |
| - **REST API**: Simple GET/POST endpoints for easy integration | |
| - **Real-time Synthesis**: Fast inference on CPU/GPU | |
| ## ๐ฃ๏ธ Supported Languages | |
| | Language | Code | Female | Male | Script | | |
| |----------|------|--------|------|--------| | |
| | Hindi | hi | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | Bengali | bn | โ | โ | เฆฌเฆพเฆเฆฒเฆพ | | |
| | Marathi | mr | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | Telugu | te | โ | โ | เฐคเฑเฐฒเฑเฐเฑ | | |
| | Kannada | kn | โ | โ | เฒเฒจเณเฒจเฒก | | |
| | Gujarati | gu | โ | - | เชเซเชเชฐเชพเชคเซ | | |
| | Bhojpuri | bho | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | Chhattisgarhi | hne | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | Maithili | mai | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | Magahi | mag | โ | โ | เคฆเฅเคตเคจเคพเคเคฐเฅ | | |
| | English | en | โ | โ | Latin | | |
| ## ๐ก API Usage | |
| ### Endpoint | |
| \`\`\` | |
| GET/POST /Get_Inference | |
| \`\`\` | |
| ### Parameters | |
| | Parameter | Type | Required | Description | | |
| |-----------|------|----------|-------------| | |
| | \`text\` | string | Yes | Text to synthesize (lowercase for English) | | |
| | \`lang\` | string | Yes | Language name (hindi, bengali, etc.) | | |
| | \`speaker_wav\` | file | Yes | Reference WAV file (for API compatibility) | | |
| ### Example (Python) | |
| \`\`\`python | |
| import requests | |
| base_url = 'https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference' | |
| WavPath = 'reference.wav' | |
| params = { | |
| 'text': 'เคจเคฎเคธเฅเคคเฅ, เคเคช เคเฅเคธเฅ เคนเฅเค?', | |
| 'lang': 'hindi', | |
| } | |
| with open(WavPath, "rb") as AudioFile: | |
| response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()}) | |
| if response.status_code == 200: | |
| with open('output.wav', 'wb') as f: | |
| f.write(response.content) | |
| print("Audio saved as 'output.wav'") | |
| \`\`\` | |
| ### Example (cURL) | |
| \`\`\`bash | |
| curl -X POST "https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\ | |
| -F "speaker[email protected]" \\ | |
| -o output.wav | |
| \`\`\` | |
| ## ๐๏ธ Model Architecture | |
| - **Base Model**: VITS (Variational Inference with adversarial learning for Text-to-Speech) | |
| - **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels) | |
| - **Decoder**: HiFi-GAN neural vocoder | |
| - **Duration Predictor**: Stochastic duration predictor for natural prosody | |
| - **Sample Rate**: 22050 Hz (16000 Hz for Gujarati) | |
| ## ๐ Training | |
| ### Datasets Used | |
| | Dataset | Languages | Hours | Source | License | | |
| |---------|-----------|-------|--------|---------| | |
| | OpenSLR-103 | Hindi | 24h | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 | | |
| | OpenSLR-37 | Bengali | 22h | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 | | |
| | OpenSLR-64 | Marathi | 30h | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 | | |
| | OpenSLR-66 | Telugu | 28h | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 | | |
| | OpenSLR-79 | Kannada | 26h | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 | | |
| | OpenSLR-78 | Gujarati | 25h | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 | | |
| | Common Voice | Hindi, Bengali | 50h+ | [Mozilla](https://commonvoice.mozilla.org/) | CC0 | | |
| | IndicTTS | Multiple | 100h+ | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research | | |
| | Indic-Voices | Multiple | 200h+ | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 | | |
| ### Training Configuration | |
| - **Epochs**: 1000 | |
| - **Batch Size**: 32 | |
| - **Learning Rate**: 2e-4 | |
| - **Optimizer**: AdamW | |
| - **FP16 Training**: Enabled | |
| - **Hardware**: NVIDIA V100/A100 GPUs | |
| ### Training Pipeline | |
| 1. **Data Preparation** (\`training/prepare_dataset.py\`) | |
| - Download audio datasets | |
| - Normalize audio to 22050 Hz | |
| - Generate text transcriptions | |
| - Create train/val splits | |
| 2. **Model Training** (\`training/train_vits.py\`) | |
| - Train VITS model with character-level tokenization | |
| - Multi-speaker training with speaker embeddings | |
| - Mixed precision training for efficiency | |
| 3. **Model Export** (\`training/export_model.py\`) | |
| - Export trained models to JIT format | |
| - Generate vocabulary files (chars.txt) | |
| - Package for inference | |
| See \`training/\` directory for full training scripts and configurations. | |
| ## ๏ฟฝ๏ฟฝ Project Structure | |
| \`\`\` | |
| VoiceAPI/ | |
| โโโ app.py # Application entry point | |
| โโโ Dockerfile # Docker configuration | |
| โโโ requirements.txt # Python dependencies | |
| โโโ src/ | |
| โ โโโ api.py # FastAPI REST server | |
| โ โโโ engine.py # TTS inference engine | |
| โ โโโ config.py # Voice configurations | |
| โ โโโ tokenizer.py # Text tokenization | |
| โ โโโ model_loader.py # Model loading utilities | |
| โโโ models/ # Trained model checkpoints | |
| โ โโโ hi_male/ # Hindi male voice | |
| โ โโโ hi_female/ # Hindi female voice | |
| โ โโโ bn_male/ # Bengali male voice | |
| โ โโโ ... # Other voices | |
| โโโ training/ | |
| โโโ train_vits.py # VITS training script | |
| โโโ prepare_dataset.py # Data preparation | |
| โโโ export_model.py # Model export | |
| โโโ datasets.csv # Dataset links | |
| โโโ configs/ # Training configs | |
| \`\`\` | |
| ## ๐ License | |
| - **Code**: MIT License | |
| - **Models**: CC BY 4.0 | |
| - **Datasets**: Individual licenses (see training/datasets.csv) | |
| ## ๐ Acknowledgments | |
| - [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for Indian language speech research | |
| - [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for multilingual TTS | |
| - [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library | |
| - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources | |
| - [OpenSLR](https://www.openslr.org/) for speech datasets | |
| ## ๐ง Contact | |
| Built for the **Voice Tech for All** Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities. | |