Cosmobillian / turkish_whisper_for_noisy_datas

A Whisper-large-v3 model fine-tuned for noisy Turkish speech recognition (short utterances, real-world environments).


πŸ”Ž Model Summary

  • Base model: openai/whisper-large-v3
  • Language: Turkish (tr)
  • Task: Automatic Speech Recognition (ASR) – Transcription
  • Domain: Noisy / real-world audio (street, phone mic, background noise, reverb, etc.)
  • Input audio: mono, 16 kHz, short segments (β‰ˆ 3–8 seconds)
  • Fine-tuning type: Full model (decoder-focused fine-tuning, encoder frozen)

This model is designed to perform robust speech-to-text for noisy Turkish audio, especially:

  • mobile / cheap microphone recordings
  • mild background music or chatter
  • echo / reverb (rooms, corridors etc.)

It is not a general multilingual model any more; the decoding is heavily biased towards Turkish.


βœ… Intended Use

Primary use-case:

  • Transcribing short Turkish utterances with background noise (e.g. real calls, vlogs, β€œin the wild” recordings).

Good for:

  • Prototypes of Turkish ASR systems
  • Voice-enabled assistants for Turkish users
  • Noisy datasets (phone, street, public places, YouTube-like content)

Not ideal for:

  • Long-form audio without chunking (podcasts, 1+ minute single shot)
  • High-stakes applications (medical/legal dictation) without manual review
  • Clean studio speech where smaller Whisper models already perform very well

βš™οΈ Training Details

Note: This is a custom fine-tuned model; base capabilities come from openai/whisper-large-v3.

  • Base model: openai/whisper-large-v3
  • Fine-tuned on: Private Turkish dataset of short (~5s) audio clips
    • Noisy, real-world conditions
    • Paired with manually prepared transcriptions
  • Sampling rate: 16 kHz, mono
  • Loss: Cross-entropy with label smoothing
  • Strategy:
    • Encoder frozen (only decoder fine-tuned)
    • Small learning rate to avoid catastrophic forgetting
    • Short training (1 epoch) to adapt to noise style while preserving base knowledge

Exact dataset is not public; this model should be treated as research / experimental.


πŸ“Š Evaluation

The model has been manually checked on several noisy Turkish utterances. Qualitatively:

  • Much more robust to background noise than vanilla Whisper on the same custom data
  • Better handling of casual/spontaneous speech (hesitations, filler words, etc.)
  • Occasionally produces grammatically imperfect sentences (as expected from ASR)

There is no official WER benchmark on a public dataset yet (e.g. Common Voice, MLS).
If you use this model in a paper or product, please:

  • Benchmark on your own dev/test set
  • Share WER / CER numbers if possible πŸ™

πŸš€ Quickstart (Hugging Face pipeline)

!pip install -q transformers soundfile librosa

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

MODEL_ID = "Cosmobillian/turkish_whisper_for_noisy_datas"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID).to(device)

# Dil/task prompt'unu zorla (TR + transcribe)
forced_ids = processor.get_decoder_prompt_ids(
    language="turkish",
    task="transcribe",
)
model.config.forced_decoder_ids = forced_ids
if hasattr(model, "generation_config"):
    model.generation_config.forced_decoder_ids = forced_ids


def load_audio(path, target_sr=16000):
    audio, sr = librosa.load(path, sr=None, mono=True)
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)
        sr = target_sr
    return audio, sr


def chunked_transcribe(path, chunk_sec=30.0, stride_sec=5.0, max_new_tokens=256):
    speech, sr = load_audio(path, 16000)

    chunk_size = int(chunk_sec * sr)
    stride_size = int(stride_sec * sr)

    texts = []
    start = 0

    while start < len(speech):
        end = start + chunk_size
        chunk = speech[start:end]

        if len(chunk) == 0:
            break

        inputs = processor(
            chunk,
            sampling_rate=sr,
            return_tensors="pt",
        )
        input_features = inputs.input_features.to(device)

        with torch.no_grad():
            generated_ids = model.generate(
                input_features,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                num_beams=1,
                no_repeat_ngram_size=3,
                repetition_penalty=1.2,
            )

        text = processor.batch_decode(
            generated_ids,
            skip_special_tokens=True
        )[0]

        texts.append(text)

        # bir sonraki chunk'a stride kadar kayarak git
        start = end - stride_size

    return " ".join(texts)


# Γ–RNEK KULLANIM
AUDIO_PATH = "/content/uzun_kayit.wav"
full_text = chunked_transcribe(AUDIO_PATH, chunk_sec=30, stride_sec=5, max_new_tokens=256)

print("Tam transkripsiyon:\n")
print(full_text)
Downloads last month
23
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Cosmobillian/turkish_whisper_for_noisy_datas_v1

Finetuned
(669)
this model