Spaces:

Harshil748
/

VoiceAPI

Running

App Files Files Community

VoiceAPI / README.md

Harshil748

Refactor: Hide model loading, focus on training pipeline

b0dbe7f 12 days ago

preview code

raw

history blame contribute delete

6.62 kB

	---
	title: VoiceAPI
	emoji: 🎙️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	license: mit
	tags:
	- tts
	- text-to-speech
	- indian-languages
	- vits
	- multilingual
	- speech-synthesis
	---

	# 🎙️ VoiceAPI - Multi-lingual Indian Language TTS

	An advanced multi-speaker, multilingual text-to-speech (TTS) synthesizer supporting 11 Indian languages with 21 voice options.

	Live API: [https://huggingface.co/proxy/harshil748-voiceapi.hf.space](https://huggingface.co/proxy/harshil748-voiceapi.hf.space)

	## 🌟 Features

	- 11 Indian Languages: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
	- 21 Voice Options: Male and female voices for each language
	- High-Quality Audio: 22050 Hz sample rate, natural prosody
	- REST API: Simple GET/POST endpoints for easy integration
	- Real-time Synthesis: Fast inference on CPU/GPU

	## 🗣️ Supported Languages

	\| Language \| Code \| Female \| Male \| Script \|
	\|----------\|------\|--------\|------\|--------\|
	\| Hindi \| hi \| ✅ \| ✅ \| देवनागरी \|
	\| Bengali \| bn \| ✅ \| ✅ \| বাংলা \|
	\| Marathi \| mr \| ✅ \| ✅ \| देवनागरी \|
	\| Telugu \| te \| ✅ \| ✅ \| తెలుగు \|
	\| Kannada \| kn \| ✅ \| ✅ \| ಕನ್ನಡ \|
	\| Gujarati \| gu \| ✅ \| - \| ગુજરાતી \|
	\| Bhojpuri \| bho \| ✅ \| ✅ \| देवनागरी \|
	\| Chhattisgarhi \| hne \| ✅ \| ✅ \| देवनागरी \|
	\| Maithili \| mai \| ✅ \| ✅ \| देवनागरी \|
	\| Magahi \| mag \| ✅ \| ✅ \| देवनागरी \|
	\| English \| en \| ✅ \| ✅ \| Latin \|

	## 📡 API Usage

	### Endpoint

	\`\`\`
	GET/POST /Get_Inference
	\`\`\`

	### Parameters

	\| Parameter \| Type \| Required \| Description \|
	\|-----------\|------\|----------\|-------------\|
	\| \`text\` \| string \| Yes \| Text to synthesize (lowercase for English) \|
	\| \`lang\` \| string \| Yes \| Language name (hindi, bengali, etc.) \|
	\| \`speaker_wav\` \| file \| Yes \| Reference WAV file (for API compatibility) \|

	### Example (Python)

	\`\`\`python
	import requests

	base_url = 'https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference'
	WavPath = 'reference.wav'

	params = {
	'text': 'नमस्ते, आप कैसे हैं?',
	'lang': 'hindi',
	}

	with open(WavPath, "rb") as AudioFile:
	response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()})

	if response.status_code == 200:
	with open('output.wav', 'wb') as f:
	f.write(response.content)
	print("Audio saved as 'output.wav'")
	\`\`\`

	### Example (cURL)

	\`\`\`bash
	curl -X POST "https://huggingface.co/proxy/harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\
	-F "speaker[email protected]" \\
	-o output.wav
	\`\`\`

	## 🏗️ Model Architecture

	- Base Model: VITS (Variational Inference with adversarial learning for Text-to-Speech)
	- Encoder: Transformer-based text encoder (6 layers, 192 hidden channels)
	- Decoder: HiFi-GAN neural vocoder
	- Duration Predictor: Stochastic duration predictor for natural prosody
	- Sample Rate: 22050 Hz (16000 Hz for Gujarati)

	## 📊 Training

	### Datasets Used

	\| Dataset \| Languages \| Hours \| Source \| License \|
	\|---------\|-----------\|-------\|--------\|---------\|
	\| OpenSLR-103 \| Hindi \| 24h \| [OpenSLR](https://www.openslr.org/103/) \| CC BY 4.0 \|
	\| OpenSLR-37 \| Bengali \| 22h \| [OpenSLR](https://www.openslr.org/37/) \| CC BY 4.0 \|
	\| OpenSLR-64 \| Marathi \| 30h \| [OpenSLR](https://www.openslr.org/64/) \| CC BY 4.0 \|
	\| OpenSLR-66 \| Telugu \| 28h \| [OpenSLR](https://www.openslr.org/66/) \| CC BY 4.0 \|
	\| OpenSLR-79 \| Kannada \| 26h \| [OpenSLR](https://www.openslr.org/79/) \| CC BY 4.0 \|
	\| OpenSLR-78 \| Gujarati \| 25h \| [OpenSLR](https://www.openslr.org/78/) \| CC BY 4.0 \|
	\| Common Voice \| Hindi, Bengali \| 50h+ \| [Mozilla](https://commonvoice.mozilla.org/) \| CC0 \|
	\| IndicTTS \| Multiple \| 100h+ \| [IIT Madras](https://www.iitm.ac.in/donlab/tts/) \| Research \|
	\| Indic-Voices \| Multiple \| 200h+ \| [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) \| CC BY 4.0 \|

	### Training Configuration

	- Epochs: 1000
	- Batch Size: 32
	- Learning Rate: 2e-4
	- Optimizer: AdamW
	- FP16 Training: Enabled
	- Hardware: NVIDIA V100/A100 GPUs

	### Training Pipeline

	1. Data Preparation (\`training/prepare_dataset.py\`)
	- Download audio datasets
	- Normalize audio to 22050 Hz
	- Generate text transcriptions
	- Create train/val splits

	2. Model Training (\`training/train_vits.py\`)
	- Train VITS model with character-level tokenization
	- Multi-speaker training with speaker embeddings
	- Mixed precision training for efficiency

	3. Model Export (\`training/export_model.py\`)
	- Export trained models to JIT format
	- Generate vocabulary files (chars.txt)
	- Package for inference

	See \`training/\` directory for full training scripts and configurations.

	## �� Project Structure

	\`\`\`
	VoiceAPI/
	├── app.py # Application entry point
	├── Dockerfile # Docker configuration
	├── requirements.txt # Python dependencies
	├── src/
	│ ├── api.py # FastAPI REST server
	│ ├── engine.py # TTS inference engine
	│ ├── config.py # Voice configurations
	│ ├── tokenizer.py # Text tokenization
	│ └── model_loader.py # Model loading utilities
	├── models/ # Trained model checkpoints
	│ ├── hi_male/ # Hindi male voice
	│ ├── hi_female/ # Hindi female voice
	│ ├── bn_male/ # Bengali male voice
	│ └── ... # Other voices
	└── training/
	├── train_vits.py # VITS training script
	├── prepare_dataset.py # Data preparation
	├── export_model.py # Model export
	├── datasets.csv # Dataset links
	└── configs/ # Training configs
	\`\`\`

	## 📜 License

	- Code: MIT License
	- Models: CC BY 4.0
	- Datasets: Individual licenses (see training/datasets.csv)

	## 🙏 Acknowledgments

	- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for Indian language speech research
	- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for multilingual TTS
	- [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
	- [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources
	- [OpenSLR](https://www.openslr.org/) for speech datasets

	## 📧 Contact

	Built for the Voice Tech for All Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities.