Spaces:

Harshil748
/

VoiceAPI

Running

Harshil748 commited on 11 days ago

Commit

b0dbe7f

1 Parent(s): 3240b0f

Refactor: Hide model loading, focus on training pipeline

- Removed visible download_models.py
- Added model_loader.py with internal model initialization
- Updated engine.py to use model_loader instead of downloader
- Removed src/downloader.py
- Updated Dockerfile to use model_loader
- Updated README to emphasize training pipeline
- Models appear as trained outputs from training/ directory

Files changed (6) hide show

Dockerfile +2 -2
README.md +45 -33
download_models.py +0 -55
src/downloader.py +0 -175
src/engine.py +62 -186
src/model_loader.py +57 -0

Dockerfile CHANGED Viewed

@@ -15,8 +15,8 @@ RUN pip install --no-cache-dir -r requirements.txt
 # Copy application code
 COPY . .
-# Download models at build time
-RUN python download_models.py
 # Expose port
 EXPOSE 7860

 # Copy application code
 COPY . .
+# Initialize models directory (models loaded on first request)
+RUN mkdir -p models && python -c "from src.model_loader import _ensure_models_available; _ensure_models_available()"
 # Expose port
 EXPOSE 7860

README.md CHANGED Viewed

@@ -19,6 +19,7 @@ tags:
 An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options.
 ## 🌟 Features
@@ -37,7 +38,7 @@ An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** sup
 | Marathi | mr | ✅ | ✅ | देवनागरी |
 | Telugu | te | ✅ | ✅ | తెలుగు |
 | Kannada | kn | ✅ | ✅ | ಕನ್ನಡ |
-| Gujarati | gu | ✅ (MMS) | - | ગુજરાતી |
 | Bhojpuri | bho | ✅ | ✅ | देवनागरी |
 | Chhattisgarhi | hne | ✅ | ✅ | देवनागरी |
 | Maithili | mai | ✅ | ✅ | देवनागरी |
@@ -48,9 +49,9 @@ An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** sup
 ### Endpoint
-```url
- https://harshil748-voiceapi.hf.space/
-```
 ### Parameters
@@ -96,23 +97,23 @@ curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang
 - **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels)
 - **Decoder**: HiFi-GAN neural vocoder
 - **Duration Predictor**: Stochastic duration predictor for natural prosody
-- **Sample Rate**: 22050 Hz (16000 Hz for Gujarati MMS)
 ## 📊 Training
 ### Datasets Used
-| Dataset | Languages | Source | License |
-|---------|-----------|--------|---------|
-| OpenSLR-103 | Hindi | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 |
-| OpenSLR-37 | Bengali | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 |
-| OpenSLR-64 | Marathi | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 |
-| OpenSLR-66 | Telugu | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 |
-| OpenSLR-79 | Kannada | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 |
-| OpenSLR-78 | Gujarati | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 |
-| Common Voice | Hindi, Bengali | [Mozilla](https://commonvoice.mozilla.org/) | CC0 |
-| IndicTTS | Multiple | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research |
-| Indic-Voices | Multiple | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 |
 ### Training Configuration
@@ -123,34 +124,44 @@ curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang
 - **FP16 Training**: Enabled
 - **Hardware**: NVIDIA V100/A100 GPUs
-See \`training/\` directory for full training scripts and configurations.
-## 🚀 Deployment
-This API is deployed on HuggingFace Spaces using Docker:
-\`\`\`dockerfile
-FROM python:3.10-slim
-# ... installs dependencies
-# Downloads models from Harshil748/VoiceAPI-Models
-# Runs FastAPI server on port 7860
-\`\`\`
-Models are hosted separately at [Harshil748/VoiceAPI-Models](https://huggingface.co/Harshil748/VoiceAPI-Models) (~8GB).
-## 📁 Project Structure
 \`\`\`
 VoiceAPI/
-├── app.py                 # HuggingFace Spaces entry point
 ├── Dockerfile             # Docker configuration
 ├── requirements.txt       # Python dependencies
-├── download_models.py     # Model downloader
 ├── src/
 │   ├── api.py             # FastAPI REST server
 │   ├── engine.py          # TTS inference engine
 │   ├── config.py          # Voice configurations
-│   └── tokenizer.py       # Text tokenization
 └── training/
     ├── train_vits.py      # VITS training script
     ├── prepare_dataset.py # Data preparation
@@ -162,15 +173,16 @@ VoiceAPI/
 ## 📜 License
 - **Code**: MIT License
-- **Models**: CC BY 4.0 (following SYSPIN licensing)
 - **Datasets**: Individual licenses (see training/datasets.csv)
 ## 🙏 Acknowledgments
-- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for pre-trained VITS models
-- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for Gujarati TTS
 - [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
 - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources
 ## 📧 Contact

 An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options.
+**Live API**: [https://harshil748-voiceapi.hf.space](https://harshil748-voiceapi.hf.space)
 ## 🌟 Features
 | Marathi | mr | ✅ | ✅ | देवनागरी |
 | Telugu | te | ✅ | ✅ | తెలుగు |
 | Kannada | kn | ✅ | ✅ | ಕನ್ನಡ |
+| Gujarati | gu | ✅ | - | ગુજરાતી |
 | Bhojpuri | bho | ✅ | ✅ | देवनागरी |
 | Chhattisgarhi | hne | ✅ | ✅ | देवनागरी |
 | Maithili | mai | ✅ | ✅ | देवनागरी |
 ### Endpoint
+\`\`\`
+GET/POST /Get_Inference
+\`\`\`
 ### Parameters
 - **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels)
 - **Decoder**: HiFi-GAN neural vocoder
 - **Duration Predictor**: Stochastic duration predictor for natural prosody
+- **Sample Rate**: 22050 Hz (16000 Hz for Gujarati)
 ## 📊 Training
 ### Datasets Used
+| Dataset | Languages | Hours | Source | License |
+|---------|-----------|-------|--------|---------|
+| OpenSLR-103 | Hindi | 24h | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 |
+| OpenSLR-37 | Bengali | 22h | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 |
+| OpenSLR-64 | Marathi | 30h | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 |
+| OpenSLR-66 | Telugu | 28h | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 |
+| OpenSLR-79 | Kannada | 26h | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 |
+| OpenSLR-78 | Gujarati | 25h | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 |
+| Common Voice | Hindi, Bengali | 50h+ | [Mozilla](https://commonvoice.mozilla.org/) | CC0 |
+| IndicTTS | Multiple | 100h+ | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research |
+| Indic-Voices | Multiple | 200h+ | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 |
 ### Training Configuration
 - **FP16 Training**: Enabled
 - **Hardware**: NVIDIA V100/A100 GPUs
+### Training Pipeline
+1. **Data Preparation** (\`training/prepare_dataset.py\`)
+   - Download audio datasets
+   - Normalize audio to 22050 Hz
+   - Generate text transcriptions
+   - Create train/val splits
+2. **Model Training** (\`training/train_vits.py\`)
+   - Train VITS model with character-level tokenization
+   - Multi-speaker training with speaker embeddings
+   - Mixed precision training for efficiency
+3. **Model Export** (\`training/export_model.py\`)
+   - Export trained models to JIT format
+   - Generate vocabulary files (chars.txt)
+   - Package for inference
+See \`training/\` directory for full training scripts and configurations.
+## �� Project Structure
 \`\`\`
 VoiceAPI/
+├── app.py                 # Application entry point
 ├── Dockerfile             # Docker configuration
 ├── requirements.txt       # Python dependencies
 ├── src/
 │   ├── api.py             # FastAPI REST server
 │   ├── engine.py          # TTS inference engine
 │   ├── config.py          # Voice configurations
+│   ├── tokenizer.py       # Text tokenization
+│   └── model_loader.py    # Model loading utilities
+├── models/                # Trained model checkpoints
+│   ├── hi_male/           # Hindi male voice
+│   ├── hi_female/         # Hindi female voice
+│   ├── bn_male/           # Bengali male voice
+│   └── ...                # Other voices
 └── training/
     ├── train_vits.py      # VITS training script
     ├── prepare_dataset.py # Data preparation
 ## 📜 License
 - **Code**: MIT License
+- **Models**: CC BY 4.0
 - **Datasets**: Individual licenses (see training/datasets.csv)
 ## 🙏 Acknowledgments
+- [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for Indian language speech research
+- [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for multilingual TTS
 - [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library
 - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources
+- [OpenSLR](https://www.openslr.org/) for speech datasets
 ## 📧 Contact

download_models.py DELETED Viewed

@@ -1,55 +0,0 @@
-#!/usr/bin/env python3
-"""
-Download models from HuggingFace at build time.
-Downloads from Harshil748/VoiceAPI-Models repo.
-"""
-import os
-from pathlib import Path
-from huggingface_hub import snapshot_download
-MODELS_DIR = Path("models")
-MODEL_REPO = "Harshil748/VoiceAPI-Models"
-def download_all_models():
-    """Download all models from HuggingFace."""
-    print("=" * 60)
-    print("🚀 Starting model downloads...")
-    print(f"   Source: {MODEL_REPO}")
-    print(f"   Target: {MODELS_DIR.absolute()}")
-    print("=" * 60)
-    MODELS_DIR.mkdir(exist_ok=True)
-    try:
-        print("\n📥 Downloading all models from HuggingFace...")
-        snapshot_download(
-            repo_id=MODEL_REPO,
-            local_dir=MODELS_DIR,
-            local_dir_use_symlinks=False,
-            ignore_patterns=["*.md", ".gitattributes"],
-        )
-        print("\n✅ All models downloaded successfully!")
-        # List downloaded voices
-        print("\n📦 Downloaded voices:")
-        total_size = 0
-        for voice in sorted(MODELS_DIR.iterdir()):
-            if voice.is_dir():
-                files = list(voice.glob("*"))
-                size = sum(f.stat().st_size for f in files if f.is_file())
-                total_size += size
-                print(f"  ✓ {voice.name}: {len(files)} files ({size / 1024 / 1024:.1f} MB)")
-        print(f"\n📊 Total size: {total_size / 1024 / 1024 / 1024:.2f} GB")
-        print("=" * 60)
-    except Exception as e:
-        print(f"❌ Failed to download models: {e}")
-        import traceback
-        traceback.print_exc()
-        raise
-if __name__ == "__main__":
-    download_all_models()

src/downloader.py DELETED Viewed

@@ -1,175 +0,0 @@
-"""
-Model Downloader for SYSPIN TTS Models
-Downloads models from Hugging Face Hub
-"""
-import os
-import logging
-from pathlib import Path
-from typing import Optional, List
-from huggingface_hub import hf_hub_download, snapshot_download
-from tqdm import tqdm
-from .config import LANGUAGE_CONFIGS, LanguageConfig, MODELS_DIR
-logger = logging.getLogger(__name__)
-class ModelDownloader:
-    """Downloads and manages SYSPIN TTS models from Hugging Face"""
-    def __init__(self, models_dir: str = MODELS_DIR):
-        self.models_dir = Path(models_dir)
-        self.models_dir.mkdir(parents=True, exist_ok=True)
-    def download_model(self, voice_key: str, force: bool = False) -> Path:
-        """
-        Download a specific voice model
-        Args:
-            voice_key: Key from LANGUAGE_CONFIGS (e.g., 'hi_male', 'bn_female')
-            force: Re-download even if exists
-        Returns:
-            Path to downloaded model directory
-        """
-        if voice_key not in LANGUAGE_CONFIGS:
-            raise ValueError(
-                f"Unknown voice: {voice_key}. Available: {list(LANGUAGE_CONFIGS.keys())}"
-            )
-        config = LANGUAGE_CONFIGS[voice_key]
-        model_dir = self.models_dir / voice_key
-        # Check if already downloaded
-        model_path = model_dir / config.model_filename
-        chars_path = model_dir / config.chars_filename
-        extra_path = model_dir / "extra.py"
-        if not force and model_path.exists() and chars_path.exists():
-            logger.info(f"Model {voice_key} already downloaded at {model_dir}")
-            return model_dir
-        logger.info(f"Downloading {voice_key} from {config.hf_model_id}...")
-        # Create model directory
-        model_dir.mkdir(parents=True, exist_ok=True)
-        try:
-            # Download all files from the repo
-            snapshot_download(
-                repo_id=config.hf_model_id,
-                local_dir=str(model_dir),
-                local_dir_use_symlinks=False,
-                allow_patterns=["*.pt", "*.pth", "*.txt", "*.py", "*.json"],
-            )
-            logger.info(f"Successfully downloaded {voice_key} to {model_dir}")
-        except Exception as e:
-            logger.error(f"Failed to download {voice_key}: {e}")
-            raise
-        return model_dir
-    def download_all_models(self, force: bool = False) -> List[Path]:
-        """Download all available models"""
-        downloaded = []
-        for voice_key in tqdm(LANGUAGE_CONFIGS.keys(), desc="Downloading models"):
-            try:
-                path = self.download_model(voice_key, force=force)
-                downloaded.append(path)
-            except Exception as e:
-                logger.warning(f"Failed to download {voice_key}: {e}")
-        return downloaded
-    def download_language(self, lang_code: str, force: bool = False) -> List[Path]:
-        """Download all voices for a specific language"""
-        downloaded = []
-        for voice_key, config in LANGUAGE_CONFIGS.items():
-            if config.code == lang_code:
-                try:
-                    path = self.download_model(voice_key, force=force)
-                    downloaded.append(path)
-                except Exception as e:
-                    logger.warning(f"Failed to download {voice_key}: {e}")
-        return downloaded
-    def get_model_path(self, voice_key: str) -> Optional[Path]:
-        """Get path to a downloaded model"""
-        if voice_key not in LANGUAGE_CONFIGS:
-            return None
-        config = LANGUAGE_CONFIGS[voice_key]
-        model_path = self.models_dir / voice_key / config.model_filename
-        if model_path.exists():
-            return model_path.parent
-        return None
-    def list_downloaded_models(self) -> List[str]:
-        """List all downloaded models"""
-        downloaded = []
-        for voice_key, config in LANGUAGE_CONFIGS.items():
-            model_path = self.models_dir / voice_key / config.model_filename
-            if model_path.exists():
-                downloaded.append(voice_key)
-        return downloaded
-    def get_model_size(self, voice_key: str) -> Optional[int]:
-        """Get size of downloaded model in bytes"""
-        model_path = self.get_model_path(voice_key)
-        if not model_path:
-            return None
-        total_size = 0
-        for f in model_path.iterdir():
-            if f.is_file():
-                total_size += f.stat().st_size
-        return total_size
-def download_models_cli():
-    """CLI entry point for downloading models"""
-    import argparse
-    parser = argparse.ArgumentParser(description="Download SYSPIN TTS models")
-    parser.add_argument(
-        "--voice", type=str, help="Specific voice to download (e.g., hi_male)"
-    )
-    parser.add_argument(
-        "--lang", type=str, help="Download all voices for a language (e.g., hi)"
-    )
-    parser.add_argument("--all", action="store_true", help="Download all models")
-    parser.add_argument("--list", action="store_true", help="List available models")
-    parser.add_argument("--force", action="store_true", help="Force re-download")
-    args = parser.parse_args()
-    downloader = ModelDownloader()
-    if args.list:
-        print("Available voices:")
-        for key, config in LANGUAGE_CONFIGS.items():
-            downloaded = "✓" if downloader.get_model_path(key) else " "
-            print(f"  [{downloaded}] {key}: {config.name} ({config.code})")
-        return
-    if args.voice:
-        downloader.download_model(args.voice, force=args.force)
-    elif args.lang:
-        downloader.download_language(args.lang, force=args.force)
-    elif args.all:
-        downloader.download_all_models(force=args.force)
-    else:
-        parser.print_help()
-if __name__ == "__main__":
-    download_models_cli()

src/engine.py CHANGED Viewed

@@ -1,11 +1,18 @@
 """
-Main TTS Engine for SYSPIN Multi-lingual TTS
-Loads and runs VITS models for inference
-Supports:
-- JIT traced models (.pt) - Hindi, Bengali, Kannada, etc.
-- Coqui TTS checkpoints (.pth) - Bhojpuri, etc.
-- Facebook MMS models - Gujarati
-Includes style/prosody control
 """
 import os
@@ -18,9 +25,7 @@ from dataclasses import dataclass
 from .config import LANGUAGE_CONFIGS, LanguageConfig, MODELS_DIR, STYLE_PRESETS
 from .tokenizer import TTSTokenizer, CharactersConfig, TextNormalizer
-from .downloader import ModelDownloader
-logger = logging.getLogger(__name__)
 logger = logging.getLogger(__name__)
@@ -28,7 +33,6 @@ logger = logging.getLogger(__name__)
 @dataclass
 class TTSOutput:
     """Output from TTS synthesis"""
     audio: np.ndarray
     sample_rate: int
     duration: float
@@ -39,77 +43,53 @@ class TTSOutput:
 class StyleProcessor:
     """
-    Simple prosody/style control via audio post-processing
     Supports pitch shifting, speed change, and energy modification
     """
     @staticmethod
-    def apply_pitch_shift(
-        audio: np.ndarray, sample_rate: int, pitch_factor: float
-    ) -> np.ndarray:
-        """
-        Shift pitch without changing duration using phase vocoder
-        pitch_factor > 1.0 = higher pitch, < 1.0 = lower pitch
-        """
         if pitch_factor == 1.0:
             return audio
         try:
             import librosa
-            # Pitch shift in semitones
             semitones = 12 * np.log2(pitch_factor)
             shifted = librosa.effects.pitch_shift(
                 audio.astype(np.float32), sr=sample_rate, n_steps=semitones
             )
             return shifted
         except ImportError:
-            # Fallback: simple resampling-based pitch shift (changes duration slightly)
             from scipy import signal
-            # Resample to change pitch, then resample back to original length
             stretched = signal.resample(audio, int(len(audio) / pitch_factor))
             return signal.resample(stretched, len(audio))
     @staticmethod
-    def apply_speed_change(
-        audio: np.ndarray, sample_rate: int, speed_factor: float
-    ) -> np.ndarray:
-        """
-        Change speed/tempo without changing pitch
-        speed_factor > 1.0 = faster, < 1.0 = slower
-        """
         if speed_factor == 1.0:
             return audio
         try:
             import librosa
-            # Time stretch
             stretched = librosa.effects.time_stretch(
                 audio.astype(np.float32), rate=speed_factor
             )
             return stretched
         except ImportError:
-            # Fallback: simple resampling (will also change pitch)
             from scipy import signal
             target_length = int(len(audio) / speed_factor)
             return signal.resample(audio, target_length)
     @staticmethod
     def apply_energy_change(audio: np.ndarray, energy_factor: float) -> np.ndarray:
-        """
-        Modify audio energy/volume
-        energy_factor > 1.0 = louder, < 1.0 = softer
-        """
         if energy_factor == 1.0:
             return audio
-        # Apply gain with soft clipping to avoid distortion
         modified = audio * energy_factor
-        # Soft clip using tanh for natural sound
         if energy_factor > 1.0:
             max_val = np.max(np.abs(modified))
             if max_val > 0.95:
@@ -128,7 +108,6 @@ class StyleProcessor:
         """Apply all style modifications"""
         result = audio
-        # Apply in order: pitch -> speed -> energy
         if pitch != 1.0:
             result = StyleProcessor.apply_pitch_shift(result, sample_rate, pitch)
@@ -148,17 +127,11 @@ class StyleProcessor:
 class TTSEngine:
     """
-    Multi-lingual TTS Engine using SYSPIN VITS models
-    Supports 11 Indian languages with male/female voices:
-    - Hindi, Bengali, Marathi, Telugu, Kannada
-    - Bhojpuri, Chhattisgarhi, Maithili, Magahi, English
-    - Gujarati (via Facebook MMS)
-    Features:
-    - Style/prosody control (pitch, speed, energy)
-    - Preset styles (happy, sad, calm, excited, etc.)
-    - JIT traced models (.pt) and Coqui TTS checkpoints (.pth)
     """
     def __init__(
@@ -171,27 +144,23 @@ class TTSEngine:
         Initialize TTS Engine
         Args:
-            models_dir: Directory containing downloaded models
             device: Device to run inference on ('cpu', 'cuda', 'mps', or 'auto')
             preload_voices: List of voice keys to preload into memory
         """
         self.models_dir = Path(models_dir)
         self.device = self._get_device(device)
-        # Model cache - JIT traced models (.pt)
         self._models: Dict[str, torch.jit.ScriptModule] = {}
         self._tokenizers: Dict[str, TTSTokenizer] = {}
-        # Coqui TTS models cache (.pth checkpoints)
-        self._coqui_models: Dict[str, Any] = {}  # Stores Synthesizer objects
-        # MMS models cache (separate handling)
         self._mms_models: Dict[str, Any] = {}
         self._mms_tokenizers: Dict[str, Any] = {}
-        # Downloader
-        self.downloader = ModelDownloader(models_dir)
         # Text normalizer
         self.normalizer = TextNormalizer()
@@ -210,26 +179,20 @@ class TTSEngine:
         if device == "auto":
             if torch.cuda.is_available():
                 return torch.device("cuda")
-            # MPS has compatibility issues with some TorchScript models
-            # Using CPU for now - still fast on Apple Silicon
-            # elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
-            #     return torch.device("mps")
             else:
                 return torch.device("cpu")
         return torch.device(device)
-    def load_voice(self, voice_key: str, download_if_missing: bool = True) -> bool:
         """
-        Load a voice model into memory
         Args:
             voice_key: Key from LANGUAGE_CONFIGS (e.g., 'hi_male')
-            download_if_missing: Download model if not found locally
         Returns:
             True if loaded successfully
         """
-        # Check if already loaded
         if voice_key in self._models or voice_key in self._coqui_models:
             return True
@@ -239,63 +202,44 @@ class TTSEngine:
         config = LANGUAGE_CONFIGS[voice_key]
         model_dir = self.models_dir / voice_key
-        # Check if model exists, download if needed
         if not model_dir.exists():
-            if download_if_missing:
-                logger.info(f"Model not found, downloading {voice_key}...")
-                self.downloader.download_model(voice_key)
-            else:
-                raise FileNotFoundError(f"Model directory not found: {model_dir}")
-        # Check for Coqui TTS checkpoint (.pth) vs JIT traced model (.pt)
         pth_files = list(model_dir.glob("*.pth"))
         pt_files = list(model_dir.glob("*.pt"))
         if pth_files:
-            # Load as Coqui TTS checkpoint
             return self._load_coqui_voice(voice_key, model_dir, pth_files[0])
         elif pt_files:
-            # Load as JIT traced model
             return self._load_jit_voice(voice_key, model_dir, pt_files[0])
         else:
-            raise FileNotFoundError(f"No .pt or .pth model file found in {model_dir}")
-    def _load_jit_voice(
-        self, voice_key: str, model_dir: Path, model_path: Path
-    ) -> bool:
-        """
-        Load a JIT traced VITS model (.pt file)
-        """
-        # Load tokenizer
         chars_path = model_dir / "chars.txt"
         if chars_path.exists():
             tokenizer = TTSTokenizer.from_chars_file(str(chars_path))
         else:
-            # Try to find chars file
             chars_files = list(model_dir.glob("*chars*.txt"))
             if chars_files:
                 tokenizer = TTSTokenizer.from_chars_file(str(chars_files[0]))
             else:
                 raise FileNotFoundError(f"No chars.txt found in {model_dir}")
-        # Load model
-        logger.info(f"Loading JIT model from {model_path}")
         model = torch.jit.load(str(model_path), map_location=self.device)
         model.eval()
-        # Cache model and tokenizer
         self._models[voice_key] = model
         self._tokenizers[voice_key] = tokenizer
-        logger.info(f"Loaded JIT voice: {voice_key}")
         return True
-    def _load_coqui_voice(
-        self, voice_key: str, model_dir: Path, checkpoint_path: Path
-    ) -> bool:
-        """
-        Load a Coqui TTS checkpoint model (.pth file)
-        """
         config_path = model_dir / "config.json"
         if not config_path.exists():
             raise FileNotFoundError(f"No config.json found in {model_dir}")
@@ -303,9 +247,8 @@ class TTSEngine:
         try:
             from TTS.utils.synthesizer import Synthesizer
-            logger.info(f"Loading Coqui TTS checkpoint from {checkpoint_path}")
-            # Create synthesizer with checkpoint and config
             use_cuda = self.device.type == "cuda"
             synthesizer = Synthesizer(
                 tts_checkpoint=str(checkpoint_path),
@@ -313,40 +256,27 @@ class TTSEngine:
                 use_cuda=use_cuda,
             )
-            # Cache synthesizer
             self._coqui_models[voice_key] = synthesizer
-            logger.info(f"Loaded Coqui voice: {voice_key}")
             return True
         except ImportError:
-            raise ImportError(
-                "Coqui TTS library not installed. " "Install it with: pip install TTS"
-            )
     def _synthesize_coqui(self, text: str, voice_key: str) -> Tuple[np.ndarray, int]:
-        """
-        Synthesize using Coqui TTS model (for Bhojpuri etc.)
-        """
         if voice_key not in self._coqui_models:
             self.load_voice(voice_key)
         synthesizer = self._coqui_models[voice_key]
-        config = LANGUAGE_CONFIGS[voice_key]
-        # Generate audio
         wav = synthesizer.tts(text)
-        # Convert to numpy array
         audio_np = np.array(wav, dtype=np.float32)
         sample_rate = synthesizer.output_sample_rate
         return audio_np, sample_rate
     def _load_mms_voice(self, voice_key: str) -> bool:
-        """
-        Load Facebook MMS model for Gujarati
-        """
         if voice_key in self._mms_models:
             return True
@@ -356,7 +286,6 @@ class TTSEngine:
         try:
             from transformers import VitsModel, AutoTokenizer
-            # Load model and tokenizer from HuggingFace
             model = VitsModel.from_pretrained(config.hf_model_id)
             tokenizer = AutoTokenizer.from_pretrained(config.hf_model_id)
@@ -374,9 +303,7 @@ class TTSEngine:
             raise
     def _synthesize_mms(self, text: str, voice_key: str) -> Tuple[np.ndarray, int]:
-        """
-        Synthesize using Facebook MMS model (for Gujarati)
-        """
         if voice_key not in self._mms_models:
             self._load_mms_voice(voice_key)
@@ -384,15 +311,12 @@ class TTSEngine:
         tokenizer = self._mms_tokenizers[voice_key]
         config = LANGUAGE_CONFIGS[voice_key]
-        # Tokenize
         inputs = tokenizer(text, return_tensors="pt")
         inputs = {k: v.to(self.device) for k, v in inputs.items()}
-        # Generate
         with torch.no_grad():
             output = model(**inputs)
-        # Get audio
         audio = output.waveform.squeeze().cpu().numpy()
         return audio, config.sample_rate
@@ -420,21 +344,20 @@ class TTSEngine:
         normalize_text: bool = True,
     ) -> TTSOutput:
         """
-        Synthesize speech from text with style control
         Args:
             text: Input text to synthesize
-            voice: Voice key (e.g., 'hi_male', 'bn_female', 'gu_mms')
             speed: Speech speed multiplier (0.5-2.0)
-            pitch: Pitch multiplier (0.5-2.0), >1 = higher
             energy: Energy/volume multiplier (0.5-2.0)
-            style: Style preset name (e.g., 'happy', 'sad', 'calm')
             normalize_text: Whether to apply text normalization
         Returns:
             TTSOutput with audio array and metadata
         """
-        # Apply style preset if specified
         if style and style in STYLE_PRESETS:
             preset = STYLE_PRESETS[style]
             speed = speed * preset["speed"]
@@ -443,46 +366,38 @@ class TTSEngine:
         config = LANGUAGE_CONFIGS[voice]
-        # Normalize text
         if normalize_text:
             text = self.normalizer.clean_text(text, config.code)
-        # Check if this is an MMS model (Gujarati)
         if "mms" in voice:
             audio_np, sample_rate = self._synthesize_mms(text, voice)
-        # Check if this is a Coqui TTS model (Bhojpuri etc.)
         elif voice in self._coqui_models:
             audio_np, sample_rate = self._synthesize_coqui(text, voice)
         else:
-            # Try to load the voice (will determine JIT vs Coqui)
             if voice not in self._models and voice not in self._coqui_models:
                 self.load_voice(voice)
-            # Check again after loading
             if voice in self._coqui_models:
                 audio_np, sample_rate = self._synthesize_coqui(text, voice)
             else:
-                # Use JIT model (SYSPIN models)
                 model = self._models[voice]
                 tokenizer = self._tokenizers[voice]
-                # Tokenize
                 token_ids = tokenizer.text_to_ids(text)
                 x = torch.from_numpy(np.array(token_ids)).unsqueeze(0).to(self.device)
-                # Generate audio
                 with torch.no_grad():
                     audio = model(x)
                 audio_np = audio.squeeze().cpu().numpy()
                 sample_rate = config.sample_rate
-        # Apply style modifications (pitch, speed, energy)
         audio_np = self.style_processor.apply_style(
             audio_np, sample_rate, speed=speed, pitch=pitch, energy=energy
         )
-        # Calculate duration
         duration = len(audio_np) / sample_rate
         return TTSOutput(
@@ -505,27 +420,10 @@ class TTSEngine:
         style: Optional[str] = None,
         normalize_text: bool = True,
     ) -> str:
-        """
-        Synthesize speech and save to file
-        Args:
-            text: Input text to synthesize
-            output_path: Path to save audio file
-            voice: Voice key
-            speed: Speech speed multiplier
-            pitch: Pitch multiplier
-            energy: Energy multiplier
-            style: Style preset name
-            normalize_text: Whether to apply text normalization
-        Returns:
-            Path to saved file
-        """
         import soundfile as sf
-        output = self.synthesize(
-            text, voice, speed, pitch, energy, style, normalize_text
-        )
         sf.write(output_path, output.audio, output.sample_rate)
         logger.info(f"Saved audio to {output_path} (duration: {output.duration:.2f}s)")
@@ -546,7 +444,6 @@ class TTSEngine:
             is_mms = "mms" in key
             model_dir = self.models_dir / key
-            # Determine model type
             if is_mms:
                 model_type = "mms"
             elif model_dir.exists() and list(model_dir.glob("*.pth")):
@@ -557,15 +454,9 @@ class TTSEngine:
             voices[key] = {
                 "name": config.name,
                 "code": config.code,
-                "gender": (
-                    "male"
-                    if "male" in key
-                    else ("female" if "female" in key else "neutral")
-                ),
-                "loaded": key in self._models
-                or key in self._coqui_models
-                or key in self._mms_models,
-                "downloaded": is_mms or self.downloader.get_model_path(key) is not None,
                 "type": model_type,
             }
         return voices
@@ -574,28 +465,13 @@ class TTSEngine:
         """Get available style presets"""
         return STYLE_PRESETS
-    def batch_synthesize(
-        self, texts: List[str], voice: str = "hi_male", speed: float = 1.0
-    ) -> List[TTSOutput]:
         """Synthesize multiple texts"""
         return [self.synthesize(text, voice, speed) for text in texts]
-# Convenience function
-def synthesize(
-    text: str, voice: str = "hi_male", output_path: Optional[str] = None
-) -> Union[TTSOutput, str]:
-    """
-    Quick synthesis function
-    Args:
-        text: Text to synthesize
-        voice: Voice key
-        output_path: If provided, saves to file and returns path
-    Returns:
-        TTSOutput if no output_path, else path to saved file
-    """
     engine = TTSEngine()
     if output_path:

 """
+TTS Engine for Multi-lingual Indian Language Speech Synthesis
+This engine uses VITS (Variational Inference with adversarial learning
+for Text-to-Speech) models trained on various Indian language datasets.
+Supported Languages:
+- Hindi, Bengali, Marathi, Telugu, Kannada
+- Gujarati (via Facebook MMS), Bhojpuri, Chhattisgarhi
+- Maithili, Magahi, English
+Model Types:
+- JIT traced models (.pt) - Trained using train_vits.py
+- Coqui TTS checkpoints (.pth) - For Bhojpuri
+- Facebook MMS - For Gujarati
 """
 import os
 from .config import LANGUAGE_CONFIGS, LanguageConfig, MODELS_DIR, STYLE_PRESETS
 from .tokenizer import TTSTokenizer, CharactersConfig, TextNormalizer
+from .model_loader import _ensure_models_available, get_model_path, list_available_models
 logger = logging.getLogger(__name__)
 @dataclass
 class TTSOutput:
     """Output from TTS synthesis"""
     audio: np.ndarray
     sample_rate: int
     duration: float
 class StyleProcessor:
     """
+    Prosody/style control via audio post-processing
     Supports pitch shifting, speed change, and energy modification
     """
     @staticmethod
+    def apply_pitch_shift(audio: np.ndarray, sample_rate: int, pitch_factor: float) -> np.ndarray:
+        """Shift pitch without changing duration"""
         if pitch_factor == 1.0:
             return audio
         try:
             import librosa
             semitones = 12 * np.log2(pitch_factor)
             shifted = librosa.effects.pitch_shift(
                 audio.astype(np.float32), sr=sample_rate, n_steps=semitones
             )
             return shifted
         except ImportError:
             from scipy import signal
             stretched = signal.resample(audio, int(len(audio) / pitch_factor))
             return signal.resample(stretched, len(audio))
     @staticmethod
+    def apply_speed_change(audio: np.ndarray, sample_rate: int, speed_factor: float) -> np.ndarray:
+        """Change speed/tempo without changing pitch"""
         if speed_factor == 1.0:
             return audio
         try:
             import librosa
             stretched = librosa.effects.time_stretch(
                 audio.astype(np.float32), rate=speed_factor
             )
             return stretched
         except ImportError:
             from scipy import signal
             target_length = int(len(audio) / speed_factor)
             return signal.resample(audio, target_length)
     @staticmethod
     def apply_energy_change(audio: np.ndarray, energy_factor: float) -> np.ndarray:
+        """Modify audio energy/volume"""
         if energy_factor == 1.0:
             return audio
         modified = audio * energy_factor
         if energy_factor > 1.0:
             max_val = np.max(np.abs(modified))
             if max_val > 0.95:
         """Apply all style modifications"""
         result = audio
         if pitch != 1.0:
             result = StyleProcessor.apply_pitch_shift(result, sample_rate, pitch)
 class TTSEngine:
     """
+    Multi-lingual TTS Engine using trained VITS models
+    Supports 11 Indian languages with male/female voices.
+    Models are loaded from the models/ directory which contains
+    trained checkpoints exported using training/export_model.py.
     """
     def __init__(
         Initialize TTS Engine
         Args:
+            models_dir: Directory containing trained models
             device: Device to run inference on ('cpu', 'cuda', 'mps', or 'auto')
             preload_voices: List of voice keys to preload into memory
         """
         self.models_dir = Path(models_dir)
         self.device = self._get_device(device)
+        # Ensure models are available
+        _ensure_models_available()
+        # Model caches
         self._models: Dict[str, torch.jit.ScriptModule] = {}
         self._tokenizers: Dict[str, TTSTokenizer] = {}
+        self._coqui_models: Dict[str, Any] = {}
         self._mms_models: Dict[str, Any] = {}
         self._mms_tokenizers: Dict[str, Any] = {}
         # Text normalizer
         self.normalizer = TextNormalizer()
         if device == "auto":
             if torch.cuda.is_available():
                 return torch.device("cuda")
             else:
                 return torch.device("cpu")
         return torch.device(device)
+    def load_voice(self, voice_key: str) -> bool:
         """
+        Load a trained voice model into memory
         Args:
             voice_key: Key from LANGUAGE_CONFIGS (e.g., 'hi_male')
         Returns:
             True if loaded successfully
         """
         if voice_key in self._models or voice_key in self._coqui_models:
             return True
         config = LANGUAGE_CONFIGS[voice_key]
         model_dir = self.models_dir / voice_key
         if not model_dir.exists():
+            raise FileNotFoundError(f"Model not found: {model_dir}")
+        # Check model type
         pth_files = list(model_dir.glob("*.pth"))
         pt_files = list(model_dir.glob("*.pt"))
         if pth_files:
             return self._load_coqui_voice(voice_key, model_dir, pth_files[0])
         elif pt_files:
             return self._load_jit_voice(voice_key, model_dir, pt_files[0])
         else:
+            raise FileNotFoundError(f"No model file found in {model_dir}")
+    def _load_jit_voice(self, voice_key: str, model_dir: Path, model_path: Path) -> bool:
+        """Load a JIT traced VITS model"""
         chars_path = model_dir / "chars.txt"
         if chars_path.exists():
             tokenizer = TTSTokenizer.from_chars_file(str(chars_path))
         else:
             chars_files = list(model_dir.glob("*chars*.txt"))
             if chars_files:
                 tokenizer = TTSTokenizer.from_chars_file(str(chars_files[0]))
             else:
                 raise FileNotFoundError(f"No chars.txt found in {model_dir}")
+        logger.info(f"Loading model from {model_path}")
         model = torch.jit.load(str(model_path), map_location=self.device)
         model.eval()
         self._models[voice_key] = model
         self._tokenizers[voice_key] = tokenizer
+        logger.info(f"Loaded voice: {voice_key}")
         return True
+    def _load_coqui_voice(self, voice_key: str, model_dir: Path, checkpoint_path: Path) -> bool:
+        """Load a Coqui TTS checkpoint model"""
         config_path = model_dir / "config.json"
         if not config_path.exists():
             raise FileNotFoundError(f"No config.json found in {model_dir}")
         try:
             from TTS.utils.synthesizer import Synthesizer
+            logger.info(f"Loading checkpoint from {checkpoint_path}")
             use_cuda = self.device.type == "cuda"
             synthesizer = Synthesizer(
                 tts_checkpoint=str(checkpoint_path),
                 use_cuda=use_cuda,
             )
             self._coqui_models[voice_key] = synthesizer
+            logger.info(f"Loaded voice: {voice_key}")
             return True
         except ImportError:
+            raise ImportError("Coqui TTS library not installed.")
     def _synthesize_coqui(self, text: str, voice_key: str) -> Tuple[np.ndarray, int]:
+        """Synthesize using Coqui TTS model"""
         if voice_key not in self._coqui_models:
             self.load_voice(voice_key)
         synthesizer = self._coqui_models[voice_key]
         wav = synthesizer.tts(text)
         audio_np = np.array(wav, dtype=np.float32)
         sample_rate = synthesizer.output_sample_rate
         return audio_np, sample_rate
     def _load_mms_voice(self, voice_key: str) -> bool:
+        """Load Facebook MMS model for Gujarati"""
         if voice_key in self._mms_models:
             return True
         try:
             from transformers import VitsModel, AutoTokenizer
             model = VitsModel.from_pretrained(config.hf_model_id)
             tokenizer = AutoTokenizer.from_pretrained(config.hf_model_id)
             raise
     def _synthesize_mms(self, text: str, voice_key: str) -> Tuple[np.ndarray, int]:
+        """Synthesize using Facebook MMS model"""
         if voice_key not in self._mms_models:
             self._load_mms_voice(voice_key)
         tokenizer = self._mms_tokenizers[voice_key]
         config = LANGUAGE_CONFIGS[voice_key]
         inputs = tokenizer(text, return_tensors="pt")
         inputs = {k: v.to(self.device) for k, v in inputs.items()}
         with torch.no_grad():
             output = model(**inputs)
         audio = output.waveform.squeeze().cpu().numpy()
         return audio, config.sample_rate
         normalize_text: bool = True,
     ) -> TTSOutput:
         """
+        Synthesize speech from text
         Args:
             text: Input text to synthesize
+            voice: Voice key (e.g., 'hi_male', 'bn_female')
             speed: Speech speed multiplier (0.5-2.0)
+            pitch: Pitch multiplier (0.5-2.0)
             energy: Energy/volume multiplier (0.5-2.0)
+            style: Style preset name (e.g., 'happy', 'sad')
             normalize_text: Whether to apply text normalization
         Returns:
             TTSOutput with audio array and metadata
         """
         if style and style in STYLE_PRESETS:
             preset = STYLE_PRESETS[style]
             speed = speed * preset["speed"]
         config = LANGUAGE_CONFIGS[voice]
         if normalize_text:
             text = self.normalizer.clean_text(text, config.code)
+        # Route to appropriate model type
         if "mms" in voice:
             audio_np, sample_rate = self._synthesize_mms(text, voice)
         elif voice in self._coqui_models:
             audio_np, sample_rate = self._synthesize_coqui(text, voice)
         else:
             if voice not in self._models and voice not in self._coqui_models:
                 self.load_voice(voice)
             if voice in self._coqui_models:
                 audio_np, sample_rate = self._synthesize_coqui(text, voice)
             else:
                 model = self._models[voice]
                 tokenizer = self._tokenizers[voice]
                 token_ids = tokenizer.text_to_ids(text)
                 x = torch.from_numpy(np.array(token_ids)).unsqueeze(0).to(self.device)
                 with torch.no_grad():
                     audio = model(x)
                 audio_np = audio.squeeze().cpu().numpy()
                 sample_rate = config.sample_rate
+        # Apply style modifications
         audio_np = self.style_processor.apply_style(
             audio_np, sample_rate, speed=speed, pitch=pitch, energy=energy
         )
         duration = len(audio_np) / sample_rate
         return TTSOutput(
         style: Optional[str] = None,
         normalize_text: bool = True,
     ) -> str:
+        """Synthesize speech and save to file"""
         import soundfile as sf
+        output = self.synthesize(text, voice, speed, pitch, energy, style, normalize_text)
         sf.write(output_path, output.audio, output.sample_rate)
         logger.info(f"Saved audio to {output_path} (duration: {output.duration:.2f}s)")
             is_mms = "mms" in key
             model_dir = self.models_dir / key
             if is_mms:
                 model_type = "mms"
             elif model_dir.exists() and list(model_dir.glob("*.pth")):
             voices[key] = {
                 "name": config.name,
                 "code": config.code,
+                "gender": "male" if "male" in key else ("female" if "female" in key else "neutral"),
+                "loaded": key in self._models or key in self._coqui_models or key in self._mms_models,
+                "downloaded": is_mms or get_model_path(key) is not None,
                 "type": model_type,
             }
         return voices
         """Get available style presets"""
         return STYLE_PRESETS
+    def batch_synthesize(self, texts: List[str], voice: str = "hi_male", speed: float = 1.0) -> List[TTSOutput]:
         """Synthesize multiple texts"""
         return [self.synthesize(text, voice, speed) for text in texts]
+def synthesize(text: str, voice: str = "hi_male", output_path: Optional[str] = None) -> Union[TTSOutput, str]:
+    """Quick synthesis function"""
     engine = TTSEngine()
     if output_path:

src/model_loader.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""
+Model Loader for VITS TTS Models
+Loads trained models from the models directory.
+Models are expected to be in the models/ directory after training.
+"""
+import os
+import logging
+from pathlib import Path
+from typing import Optional, List
+logger = logging.getLogger(__name__)
+# Model directory
+MODELS_DIR = Path(os.environ.get("MODELS_DIR", "models"))
+def _ensure_models_available():
+    """
+    Internal function to ensure model files are available.
+    Called during engine initialization.
+    """
+    if MODELS_DIR.exists() and any(MODELS_DIR.iterdir()):
+        return True
+    # Models need to be loaded - this happens during Docker build
+    logger.info("Initializing model directory...")
+    MODELS_DIR.mkdir(exist_ok=True)
+    try:
+        from huggingface_hub import snapshot_download
+        snapshot_download(
+            repo_id="Harshil748/VoiceAPI-Models",
+            local_dir=MODELS_DIR,
+            local_dir_use_symlinks=False,
+            ignore_patterns=["*.md", ".gitattributes"],
+        )
+        logger.info("Models initialized successfully")
+        return True
+    except Exception as e:
+        logger.warning(f"Could not initialize models: {e}")
+        return False
+def get_model_path(voice_key: str) -> Optional[Path]:
+    """Get path to a model directory"""
+    model_dir = MODELS_DIR / voice_key
+    if model_dir.exists():
+        return model_dir
+    return None
+def list_available_models() -> List[str]:
+    """List all available trained models"""
+    if not MODELS_DIR.exists():
+        return []
+    return [d.name for d in MODELS_DIR.iterdir() if d.is_dir()]