Gilbert-Whisper-Distil-FR-v0.2 β€” Distilled Model for Production and Research

Overview

Gilbert-Whisper-Distil-FR-v0.2 is a distilled version of Whisper Large V3, optimized for French speech recognition and designed for production deployment and research acceleration within the Gilbert project ecosystem. This model provides 2-4x faster inference while maintaining performance close to the full Large V3 model, making it ideal for real-time applications and cost-effective production systems.

Important Notice on Intellectual Property:

  • This distilled model (MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing research and commercial use.
  • All derivative models, fine-tuned variants, and specialized models developed from this distilled model as part of the Gilbert project are the exclusive intellectual property of Lexia France.
  • While this model can be used freely under MIT terms, any models built upon it for the Gilbert project are proprietary and subject to separate licensing terms.

Research Context

The Gilbert project requires both high-precision baseline models (Gilbert-FR-Source) and production-optimized models for different deployment scenarios. This distilled model serves as the fast inference baseline for:

  • Real-time transcription in meeting applications
  • Cost-effective batch processing at scale
  • Speculative decoding acceleration (2x speedup with identical output)
  • Resource-constrained environments (limited GPU memory, edge devices)

This model complements the full-size Gilbert-FR-Source baseline, providing researchers and developers with a speed-optimized alternative for production deployments.


Model Details

Architecture

  • Base Model: OpenAI Whisper Large V3 (distilled)
  • Distillation Method: Patient teacher distillation with 2 decoder layers
  • Encoder: Identical to Whisper Large V3 (unchanged, shared with teacher)
  • Decoder: 2 layers (reduced from full model)
  • Framework: Compatible with Hugging Face Transformers, OpenAI Whisper, Faster Whisper, Whisper.cpp, CTranslate2, ONNX Runtime, and MLX
  • Model Size: ~3.0 GB (full precision, but more efficient inference)

Key Characteristics

  • Language: French (primary), with multilingual capabilities
  • Context Length: Long-form audio support (up to 30 minutes per segment)
  • Training: Extended to 30-second audio segments to maintain long-form transcription abilities
  • Output: Text transcription with word-level timestamps
  • Performance: Optimized for French speech recognition with speed-accuracy trade-off

Distillation Details

  • Teacher Model: OpenAI Whisper Large V3
  • Training Data: 22,000+ hours of French speech data
  • Training Schedule: 160 epochs with aggressive data augmentation
  • Method: "Patient teacher" distillation approach
  • Segment Length: 30-second segments (preserving same speaker)
  • Timestamp Training: 50% of segments trained with timestamps

Intended Use

Production Deployment

This model is optimized for:

  1. Real-time Applications: Fast inference for live transcription
  2. Batch Processing: Cost-effective processing of large audio volumes
  3. Resource-Constrained Environments: Lower memory footprint (~2-3 GB GPU)
  4. Speculative Decoding: Use as draft model for 2x speedup with identical output

Research and Development

This model serves as:

  1. Speed Baseline: Reference point for fast inference research
  2. Production Baseline: Starting point for production-optimized fine-tuning
  3. Comparative Studies: Benchmark against full-size models (speed vs accuracy)
  4. Speculative Decoding Research: Draft model for advanced decoding strategies

Use Cases

  • βœ… Meeting Transcription: Fast processing of professional meetings
  • βœ… Long-form Audio: Efficient transcription of 30-120 minute sessions
  • βœ… Real-time Systems: Live transcription with low latency
  • βœ… Cost-Sensitive Applications: Reduce inference costs by 2-4x
  • βœ… Edge Deployment: Run on devices with limited GPU memory

Performance Benchmarks

Speed Performance

Metric Distilled Model Full Large V3 Improvement
Inference Speed 2-4x faster Baseline (1x) 2-4x speedup
GPU Memory ~2-3 GB ~6-8 GB ~50% reduction
Throughput 2-4x higher Baseline 2-4x increase

Accuracy Performance

The distilled model maintains performance close to the full model on most tasks:

  • Short-form transcription: Competitive with full model
  • Long-form transcription: Maintained through 30-second segment training
  • French language: Optimized for French speech recognition
  • Post-normalization WER: Comparable to full model on standard benchmarks

Note: Exact WER metrics are evaluated on both in-distribution (ID) and out-of-distribution (OOD) datasets. Performance may vary slightly on complex tasks compared to the full model, but the speed-accuracy trade-off is highly favorable for production use.

Speculative Decoding Performance

When used as a draft model with the full Whisper Large V3 as teacher:

  • Speed: 2x faster than teacher alone
  • Output: Identical to teacher (guaranteed)
  • Memory: Only decoder needs to be loaded (encoder shared)
  • Use Case: Best of both worlds - speed of distil + accuracy of full model

Usage

Installation

pip install transformers torch torchaudio librosa soundfile

Basic Usage with Transformers

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True
)
model.to(device)

# Process audio
audio_path = "your_audio.wav"
inputs = processor(audio_path, return_tensors="pt", sampling_rate=16000)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        language="fr",
        task="transcribe"
    )

transcription = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

Usage with OpenAI Whisper

import whisper

# Load the distilled model
model = whisper.load_model("large-v3")  # Compatible format

# Transcribe French audio
result = model.transcribe(
    "audio.wav",
    language="fr",
    task="transcribe"
)

print(result["text"])

Speculative Decoding (Advanced)

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

# Load both models
draft_model_id = "MEscriva/gilbert-whisper-distil-fr-v0.2"
teacher_model_id = "openai/whisper-large-v3"  # or Gilbert-FR-Source

processor = AutoProcessor.from_pretrained(draft_model_id)
draft_model = AutoModelForSpeechSeq2Seq.from_pretrained(draft_model_id)
teacher_model = AutoModelForSpeechSeq2Seq.from_pretrained(teacher_model_id)

# Use speculative decoding for 2x speedup with identical output
# (Implementation details depend on your decoding framework)

Comparison with Other Gilbert Models

Gilbert-FR-Source (Full Model)

Aspect Distilled Full Source
Speed ⭐⭐⭐⭐⭐ (2-4x) ⭐⭐⭐ (1x)
Accuracy ⭐⭐⭐⭐ (proche) ⭐⭐⭐⭐⭐ (max)
Memory ⭐⭐⭐⭐⭐ (faible) ⭐⭐⭐ (élevée)
Use Case Production, Real-time Research, Max accuracy

When to Use Each

Use Distilled Model if:

  • βœ… Speed is critical (real-time, batch processing)
  • βœ… Resources are limited (GPU memory, costs)
  • βœ… Production deployment
  • βœ… Speculative decoding desired

Use Full Source Model if:

  • βœ… Maximum accuracy required
  • βœ… Research baseline needed
  • βœ… Complex audio conditions
  • βœ… Resources available

Research Methodology

Distillation Approach

This model uses patient teacher distillation:

  1. Teacher Model: OpenAI Whisper Large V3 (frozen)
  2. Student Model: Reduced decoder (2 layers) with same encoder
  3. Training: Extended schedule (160 epochs) with aggressive augmentation
  4. Data: 22,000+ hours of French speech, 30-second segments
  5. Objective: Maintain performance while reducing inference cost

Evaluation Standards

  • Speed Metrics: Inference time, throughput, memory usage
  • Accuracy Metrics: WER, CER, BLEU (post-normalization)
  • Long-form: Evaluation on 30+ minute audio segments
  • Comparison: Against full model and other baselines

Versioning

  • Current Version: v0.2 (Extended training for long-form)
  • Base Version: v0.1 (Initial distillation)
  • Future Versions: Production-optimized variants will reference this baseline

Limitations

This distilled model inherits limitations from the distillation process:

  1. Accuracy Trade-off: Slightly lower accuracy than full model on complex tasks
  2. Complex Audio: May struggle more than full model on very challenging audio
  3. Out-of-Distribution: Performance may degrade more on OOD data
  4. Long-form Edge Cases: Very long segments (>60 min) may show more degradation

However, for most production use cases, the speed-accuracy trade-off is highly favorable.


Future Research Directions

Planned Gilbert Distilled Variants

  1. Gilbert-Distil-Meetings-v1

    • Fine-tuned on meeting data
    • Optimized for multi-speaker scenarios
    • Production-ready for meeting transcription
  2. Gilbert-Distil-Longform-v1

    • Enhanced long-form capabilities
    • Better context stability
    • Optimized for 30-120 minute sessions
  3. Gilbert-Distil-Accents-v1

    • Robustness to regional accents
    • Fine-tuned on diverse French accents
    • Production deployment for international use

All future Gilbert models are the exclusive intellectual property of Lexia France and will include detailed evaluation reports.


Intellectual Property and Licensing

License for This Model

This distilled model (MEscriva/gilbert-whisper-distil-fr-v0.2) is distributed under the MIT License, allowing:

  • βœ… Commercial use
  • βœ… Modification
  • βœ… Distribution
  • βœ… Private use
  • βœ… Patent use

See the LICENSE file for full terms.

Intellectual Property Notice

Important: While this model is available under MIT License:

  • All derivative models, fine-tuned variants, and specialized models developed as part of the Gilbert project are the exclusive intellectual property of Lexia France.
  • Use of this model for Gilbert project development implies acceptance of these IP terms.
  • Commercial use of Gilbert project derivatives requires separate licensing agreements.

For licensing inquiries regarding Gilbert project models, contact: [email protected]


Citation

If you use this distilled model in your research, please cite:

@software{gilbert_distil_2024,
  title={Gilbert-Whisper-Distil-FR-v0.2: Distilled Model for Production and Research},
  author={MEscriva and Lexia France},
  year={2024},
  url={https://huggingface.co/MEscriva/gilbert-whisper-distil-fr-v0.2},
  version={0.2},
  note={Distilled Whisper model for fast French speech recognition}
}

Acknowledgments

This distilled model is based on:

  • OpenAI Whisper Large V3 (MIT License)
  • bofenghuang/whisper-large-v3-distil-fr-v0.2 (Original distillation work)

We acknowledge the contributions of:

  • OpenAI for developing and open-sourcing Whisper
  • Bofeng Huang for the French distillation work
  • Hugging Face for implementing Whisper in Transformers and creating Distil-Whisper
  • The open-source community

Contact

For research collaboration, evaluation access, or technical inquiries:


Changelog

Version 0.2 (2024-12-19)

  • Initial Gilbert release
  • Based on bofenghuang/whisper-large-v3-distil-fr-v0.2
  • Extended training for long-form transcription (30-second segments)
  • Patient teacher distillation method
  • Multiple format support (Transformers, Whisper, Faster Whisper, etc.)

Β© 2024 Lexia France. All rights reserved for Gilbert project derivatives.

Downloads last month
104
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MEscriva/gilbert-whisper-distil-fr-v0.2

Finetuned
(667)
this model