Gilbert-Whisper-Distil-FR-v0.2 — Distilled Model for Production and Research

Overview

Gilbert-Whisper-Distil-FR-v0.2 is a distilled version of Whisper Large V3, optimized for French speech recognition and designed for production deployment and research acceleration within the Gilbert project ecosystem. This model provides 2-4x faster inference while maintaining performance close to the full Large V3 model, making it ideal for real-time applications and cost-effective production systems.

Important Notice on Intellectual Property:

This distilled model (MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing research and commercial use.
All derivative models, fine-tuned variants, and specialized models developed from this distilled model as part of the Gilbert project are the exclusive intellectual property of Lexia France.
While this model can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.

Research Context

The Gilbert project requires both high-precision baseline models (Gilbert-FR-Source) and production-optimized models for different deployment scenarios. This distilled model serves as the fast inference baseline for:

Real-time transcription in meeting applications
Cost-effective batch processing at scale
Speculative decoding acceleration (2x speedup with identical output)
Resource-constrained environments (limited GPU memory, edge devices)

This model complements the full-size Gilbert-FR-Source baseline, providing researchers and developers with a speed-optimized alternative for production deployments.

Model Details

Architecture

Base Model: OpenAI Whisper Large V3 (distilled)
Distillation Method: Patient teacher distillation with 2 decoder layers
Encoder: Identical to Whisper Large V3 (unchanged, shared with teacher)
Decoder: 2 layers (reduced from full model)
Framework: Compatible with Hugging Face Transformers, OpenAI Whisper, Faster Whisper, Whisper.cpp, CTranslate2, ONNX Runtime, and MLX
Model Size: ~3.0 GB (full precision, but more efficient inference)

Key Characteristics

Language: French (primary), with multilingual capabilities
Context Length: Long-form audio support (up to 30 minutes per segment)
Training: Extended to 30-second audio segments to maintain long-form transcription abilities
Output: Text transcription with word-level timestamps
Performance: Optimized for French speech recognition with speed-accuracy trade-off

Distillation Details

Teacher Model: OpenAI Whisper Large V3
Training Data: 22,000+ hours of French speech data
Training Schedule: 160 epochs with aggressive data augmentation
Method: "Patient teacher" distillation approach
Segment Length: 30-second segments (preserving same speaker)
Timestamp Training: 50% of segments trained with timestamps

Intended Use

Production Deployment

This model is optimized for:

Real-time Applications: Fast inference for live transcription
Batch Processing: Cost-effective processing of large audio volumes
Resource-Constrained Environments: Lower memory footprint (~2-3 GB GPU)
Speculative Decoding: Use as draft model for 2x speedup with identical output

Research and Development

This model serves as:

Speed Baseline: Reference point for fast inference research
Production Baseline: Starting point for production-optimized fine-tuning
Comparative Studies: Benchmark against full-size models (speed vs accuracy)
Speculative Decoding Research: Draft model for advanced decoding strategies

Use Cases

✅ Meeting Transcription: Fast processing of professional meetings
✅ Long-form Audio: Efficient transcription of 30-120 minute sessions
✅ Real-time Systems: Live transcription with low latency
✅ Cost-Sensitive Applications: Reduce inference costs by 2-4x
✅ Edge Deployment: Run on devices with limited GPU memory

Performance Benchmarks

Speed Performance

Metric	Distilled Model	Full Large V3	Improvement
Inference Speed	2-4x faster	Baseline (1x)	2-4x speedup
GPU Memory	~2-3 GB	~6-8 GB	~50% reduction
Throughput	2-4x higher	Baseline	2-4x increase

Accuracy Performance

The distilled model maintains performance close to the full model on most tasks:

Short-form transcription: Competitive with full model
Long-form transcription: Maintained through 30-second segment training
French language: Optimized for French speech recognition
Post-normalization WER: Comparable to full model on standard benchmarks

Note: Exact WER metrics are evaluated on both in-distribution (ID) and out-of-distribution (OOD) datasets. Performance may vary slightly on complex tasks compared to the full model, but the speed-accuracy trade-off is highly favorable for production use.

Speculative Decoding Performance

When used as a draft model with the full Whisper Large V3 as teacher:

Speed: 2x faster than teacher alone
Output: Identical to teacher (guaranteed)
Memory: Only decoder needs to be loaded (encoder shared)
Use Case: Best of both worlds - speed of distil + accuracy of full model

Usage

Installation

pip install transformers torch torchaudio librosa soundfile

Basic Usage with Transformers

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True
)
model.to(device)

# Process audio
audio_path = "your_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        language="fr",
        task="transcribe"
    )

transcription = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

Usage with OpenAI Whisper

import whisper

# Load the distilled model
model = whisper.load_model("large-v3")  # Compatible format

# Transcribe French audio
result = model.transcribe(
    "audio.wav",
    language="fr",
    task="transcribe"
)

print(result["text"])

Speculative Decoding (Advanced)

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

# Load both models
draft_model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
teacher_model_id = "openai/whisper-large-v3"  # or Gilbert-FR-Source

processor = AutoProcessor.from_pretrained(draft_model_id)
draft_model = AutoModelForSpeechSeq2Seq.from_pretrained(draft_model_id)
teacher_model = AutoModelForSpeechSeq2Seq.from_pretrained(teacher_model_id)

# Use speculative decoding for 2x speedup with identical output
# (Implementation details depend on your decoding framework)

Comparison with Other Gilbert Models

Gilbert-FR-Source (Full Model)

Aspect	Distilled	Full Source
Speed	⭐⭐⭐⭐⭐ (2-4x)	⭐⭐⭐ (1x)
Accuracy	⭐⭐⭐⭐ (proche)	⭐⭐⭐⭐⭐ (max)
Memory	⭐⭐⭐⭐⭐ (faible)	⭐⭐⭐ (élevée)
Use Case	Production, Real-time	Research, Max accuracy

When to Use Each

Use Distilled Model if:

✅ Speed is critical (real-time, batch processing)
✅ Resources are limited (GPU memory, costs)
✅ Production deployment
✅ Speculative decoding desired

Use Full Source Model if:

✅ Maximum accuracy required
✅ Research baseline needed
✅ Complex audio conditions
✅ Resources available

Research Methodology

Distillation Approach

This model uses patient teacher distillation:

Teacher Model: OpenAI Whisper Large V3 (frozen)
Student Model: Reduced decoder (2 layers) with same encoder
Training: Extended schedule (160 epochs) with aggressive augmentation
Data: 22,000+ hours of French speech, 30-second segments
Objective: Maintain performance while reducing inference cost

Evaluation Standards

Speed Metrics: Inference time, throughput, memory usage
Accuracy Metrics: WER, CER, BLEU (post-normalization)
Long-form: Evaluation on 30+ minute audio segments
Comparison: Against full model and other baselines

Versioning

Current Version: v0.2 (Extended training for long-form)
Base Version: v0.1 (Initial distillation)
Future Versions: Production-optimized variants will reference this baseline

Limitations

This distilled model inherits limitations from the distillation process:

Accuracy Trade-off: Slightly lower accuracy than full model on complex tasks
Complex Audio: May struggle more than full model on very challenging audio
Out-of-Distribution: Performance may degrade more on OOD data
Long-form Edge Cases: Very long segments (>60 min) may show more degradation

However, for most production use cases, the speed-accuracy trade-off is highly favorable.

Future Research Directions

Planned Gilbert Distilled Variants

Gilbert-Distil-Meetings-v1
- Fine-tuned on meeting data
- Optimized for multi-speaker scenarios
- Production-ready for meeting transcription
Gilbert-Distil-Longform-v1
- Enhanced long-form capabilities
- Better context stability
- Optimized for 30-120 minute sessions
Gilbert-Distil-Accents-v1
- Robustness to regional accents
- Fine-tuned on diverse French accents
- Production deployment for international use

All future Gilbert models are the exclusive intellectual property of Lexia France and will include detailed evaluation reports.

Intellectual Property and Licensing

License for This Model

This distilled model (MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing:

✅ Commercial use
✅ Modification
✅ Distribution
✅ Private use
✅ Patent use

See the LICENSE file for full terms.

Intellectual Property Notice

Important: While this model is available under MIT License:

All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.
Use of this model for Gilbert project development implies acceptance of these IP terms.
Commercial use of Gilbert project derivatives requires separate licensing agreements.

For licensing inquiries regarding Gilbert project models, contact: [email protected]

Citation

If you use this distilled model in your research, please cite:

@software{gilbert_distil_2024,
  title={Gilbert-Whisper-Distil-FR-v0.2: Distilled Model for Production and Research},
  author={MEscriva and Lexia France},
  year={2024},
  url={https://huggingface.co/MEscriva/gilbert-whisper-distil-fr-v0.2},
  version={0.2},
  note={Distilled Whisper model for fast French speech recognition}
}

Acknowledgments

This distilled model is based on:

OpenAI Whisper Large V3 (MIT License)
bofenghuang/whisper-large-v3-distil-fr-v0.2 (Original distillation work)

We acknowledge the contributions of:

OpenAI for developing and open-sourcing Whisper
Bofeng Huang for the French distillation work
Hugging Face for implementing Whisper in Transformers and creating Distil-Whisper
The open-source community

Contact

For research collaboration, evaluation access, or technical inquiries:

Website: https://gilbert-assistant.fr
Email: [email protected]
Repository: https://huggingface.co/MEscriva/gilbert-whisper-distil-fr-v0.2

Changelog

Version 0.2 (2024-12-19)

Initial Gilbert release
Based on bofenghuang/whisper-large-v3-distil-fr-v0.2
Extended training for long-form transcription (30-second segments)
Patient teacher distillation method
Multiple format support (Transformers, Whisper, Faster Whisper, etc.)

Downloads last month: 104

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for MEscriva/gilbert-whisper-distil-fr-v0.2

Base model

openai/whisper-large-v3

Finetuned

(667)

this model