Gilbert-Whisper-Distil-FR-v0.2 β Distilled Model for Production and Research
Overview
Gilbert-Whisper-Distil-FR-v0.2 is a distilled version of Whisper Large V3, optimized for French speech recognition and designed for production deployment and research acceleration within the Gilbert project ecosystem. This model provides 2-4x faster inference while maintaining performance close to the full Large V3 model, making it ideal for real-time applications and cost-effective production systems.
Important Notice on Intellectual Property:
- This distilled model (
MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing research and commercial use. - All derivative models, fine-tuned variants, and specialized models developed from this distilled model as part of the Gilbert project are the exclusive intellectual property of Lexia France.
- While this model can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.
Research Context
The Gilbert project requires both high-precision baseline models (Gilbert-FR-Source) and production-optimized models for different deployment scenarios. This distilled model serves as the fast inference baseline for:
- Real-time transcription in meeting applications
- Cost-effective batch processing at scale
- Speculative decoding acceleration (2x speedup with identical output)
- Resource-constrained environments (limited GPU memory, edge devices)
This model complements the full-size Gilbert-FR-Source baseline, providing researchers and developers with a speed-optimized alternative for production deployments.
Model Details
Architecture
- Base Model: OpenAI Whisper Large V3 (distilled)
- Distillation Method: Patient teacher distillation with 2 decoder layers
- Encoder: Identical to Whisper Large V3 (unchanged, shared with teacher)
- Decoder: 2 layers (reduced from full model)
- Framework: Compatible with Hugging Face Transformers, OpenAI Whisper, Faster Whisper, Whisper.cpp, CTranslate2, ONNX Runtime, and MLX
- Model Size: ~3.0 GB (full precision, but more efficient inference)
Key Characteristics
- Language: French (primary), with multilingual capabilities
- Context Length: Long-form audio support (up to 30 minutes per segment)
- Training: Extended to 30-second audio segments to maintain long-form transcription abilities
- Output: Text transcription with word-level timestamps
- Performance: Optimized for French speech recognition with speed-accuracy trade-off
Distillation Details
- Teacher Model: OpenAI Whisper Large V3
- Training Data: 22,000+ hours of French speech data
- Training Schedule: 160 epochs with aggressive data augmentation
- Method: "Patient teacher" distillation approach
- Segment Length: 30-second segments (preserving same speaker)
- Timestamp Training: 50% of segments trained with timestamps
Intended Use
Production Deployment
This model is optimized for:
- Real-time Applications: Fast inference for live transcription
- Batch Processing: Cost-effective processing of large audio volumes
- Resource-Constrained Environments: Lower memory footprint (~2-3 GB GPU)
- Speculative Decoding: Use as draft model for 2x speedup with identical output
Research and Development
This model serves as:
- Speed Baseline: Reference point for fast inference research
- Production Baseline: Starting point for production-optimized fine-tuning
- Comparative Studies: Benchmark against full-size models (speed vs accuracy)
- Speculative Decoding Research: Draft model for advanced decoding strategies
Use Cases
- β Meeting Transcription: Fast processing of professional meetings
- β Long-form Audio: Efficient transcription of 30-120 minute sessions
- β Real-time Systems: Live transcription with low latency
- β Cost-Sensitive Applications: Reduce inference costs by 2-4x
- β Edge Deployment: Run on devices with limited GPU memory
Performance Benchmarks
Speed Performance
| Metric | Distilled Model | Full Large V3 | Improvement |
|---|---|---|---|
| Inference Speed | 2-4x faster | Baseline (1x) | 2-4x speedup |
| GPU Memory | ~2-3 GB | ~6-8 GB | ~50% reduction |
| Throughput | 2-4x higher | Baseline | 2-4x increase |
Accuracy Performance
The distilled model maintains performance close to the full model on most tasks:
- Short-form transcription: Competitive with full model
- Long-form transcription: Maintained through 30-second segment training
- French language: Optimized for French speech recognition
- Post-normalization WER: Comparable to full model on standard benchmarks
Note: Exact WER metrics are evaluated on both in-distribution (ID) and out-of-distribution (OOD) datasets. Performance may vary slightly on complex tasks compared to the full model, but the speed-accuracy trade-off is highly favorable for production use.
Speculative Decoding Performance
When used as a draft model with the full Whisper Large V3 as teacher:
- Speed: 2x faster than teacher alone
- Output: Identical to teacher (guaranteed)
- Memory: Only decoder needs to be loaded (encoder shared)
- Use Case: Best of both worlds - speed of distil + accuracy of full model
Usage
Installation
pip install transformers torch torchaudio librosa soundfile
Basic Usage with Transformers
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True
)
model.to(device)
# Process audio
audio_path = "your_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(
inputs["input_features"],
language="fr",
task="transcribe"
)
transcription = processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
Usage with OpenAI Whisper
import whisper
# Load the distilled model
model = whisper.load_model("large-v3") # Compatible format
# Transcribe French audio
result = model.transcribe(
"audio.wav",
language="fr",
task="transcribe"
)
print(result["text"])
Speculative Decoding (Advanced)
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
# Load both models
draft_model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
teacher_model_id = "openai/whisper-large-v3" # or Gilbert-FR-Source
processor = AutoProcessor.from_pretrained(draft_model_id)
draft_model = AutoModelForSpeechSeq2Seq.from_pretrained(draft_model_id)
teacher_model = AutoModelForSpeechSeq2Seq.from_pretrained(teacher_model_id)
# Use speculative decoding for 2x speedup with identical output
# (Implementation details depend on your decoding framework)
Comparison with Other Gilbert Models
Gilbert-FR-Source (Full Model)
| Aspect | Distilled | Full Source |
|---|---|---|
| Speed | βββββ (2-4x) | βββ (1x) |
| Accuracy | ββββ (proche) | βββββ (max) |
| Memory | βββββ (faible) | βββ (Γ©levΓ©e) |
| Use Case | Production, Real-time | Research, Max accuracy |
When to Use Each
Use Distilled Model if:
- β Speed is critical (real-time, batch processing)
- β Resources are limited (GPU memory, costs)
- β Production deployment
- β Speculative decoding desired
Use Full Source Model if:
- β Maximum accuracy required
- β Research baseline needed
- β Complex audio conditions
- β Resources available
Research Methodology
Distillation Approach
This model uses patient teacher distillation:
- Teacher Model: OpenAI Whisper Large V3 (frozen)
- Student Model: Reduced decoder (2 layers) with same encoder
- Training: Extended schedule (160 epochs) with aggressive augmentation
- Data: 22,000+ hours of French speech, 30-second segments
- Objective: Maintain performance while reducing inference cost
Evaluation Standards
- Speed Metrics: Inference time, throughput, memory usage
- Accuracy Metrics: WER, CER, BLEU (post-normalization)
- Long-form: Evaluation on 30+ minute audio segments
- Comparison: Against full model and other baselines
Versioning
- Current Version: v0.2 (Extended training for long-form)
- Base Version: v0.1 (Initial distillation)
- Future Versions: Production-optimized variants will reference this baseline
Limitations
This distilled model inherits limitations from the distillation process:
- Accuracy Trade-off: Slightly lower accuracy than full model on complex tasks
- Complex Audio: May struggle more than full model on very challenging audio
- Out-of-Distribution: Performance may degrade more on OOD data
- Long-form Edge Cases: Very long segments (>60 min) may show more degradation
However, for most production use cases, the speed-accuracy trade-off is highly favorable.
Future Research Directions
Planned Gilbert Distilled Variants
Gilbert-Distil-Meetings-v1
- Fine-tuned on meeting data
- Optimized for multi-speaker scenarios
- Production-ready for meeting transcription
Gilbert-Distil-Longform-v1
- Enhanced long-form capabilities
- Better context stability
- Optimized for 30-120 minute sessions
Gilbert-Distil-Accents-v1
- Robustness to regional accents
- Fine-tuned on diverse French accents
- Production deployment for international use
All future Gilbert models are the exclusive intellectual property of Lexia France and will include detailed evaluation reports.
Intellectual Property and Licensing
License for This Model
This distilled model (MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing:
- β Commercial use
- β Modification
- β Distribution
- β Private use
- β Patent use
See the LICENSE file for full terms.
Intellectual Property Notice
Important: While this model is available under MIT License:
- All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.
- Use of this model for Gilbert project development implies acceptance of these IP terms.
- Commercial use of Gilbert project derivatives requires separate licensing agreements.
For licensing inquiries regarding Gilbert project models, contact: [email protected]
Citation
If you use this distilled model in your research, please cite:
@software{gilbert_distil_2024,
title={Gilbert-Whisper-Distil-FR-v0.2: Distilled Model for Production and Research},
author={MEscriva and Lexia France},
year={2024},
url={https://huggingface.co/MEscriva/gilbert-whisper-distil-fr-v0.2},
version={0.2},
note={Distilled Whisper model for fast French speech recognition}
}
Acknowledgments
This distilled model is based on:
- OpenAI Whisper Large V3 (MIT License)
- bofenghuang/whisper-large-v3-distil-fr-v0.2 (Original distillation work)
We acknowledge the contributions of:
- OpenAI for developing and open-sourcing Whisper
- Bofeng Huang for the French distillation work
- Hugging Face for implementing Whisper in Transformers and creating Distil-Whisper
- The open-source community
Contact
For research collaboration, evaluation access, or technical inquiries:
- Website: https://gilbert-assistant.fr
- Email: [email protected]
- Repository: https://huggingface.co/MEscriva/gilbert-whisper-distil-fr-v0.2
Changelog
Version 0.2 (2024-12-19)
- Initial Gilbert release
- Based on bofenghuang/whisper-large-v3-distil-fr-v0.2
- Extended training for long-form transcription (30-second segments)
- Patient teacher distillation method
- Multiple format support (Transformers, Whisper, Faster Whisper, etc.)
Β© 2024 Lexia France. All rights reserved for Gilbert project derivatives.
- Downloads last month
- 104
Model tree for MEscriva/gilbert-whisper-distil-fr-v0.2
Base model
openai/whisper-large-v3