fixie-ai/common_voice_17_0
Viewer • Updated • 11.4M • 127k • 15
A newer model is available — please use syvai/hviske-v5.3 instead. v5.3 is the current recommended Danish ASR model from this family and reaches 13.91% strict WER on the CoRal v3 full test set (beam=5). This v5 checkpoint is kept for reproducibility only.
A 2B-parameter Conformer encoder-decoder ASR model trained for Danish speech recognition.
Trained on 3.5M samples (16,000 hours) of Danish speech across 7 datasets.
| Version | Training data | Avg WER |
|---|---|---|
| Base architecture | — | >100% (no Danish pretraining) |
| hviske-v1 | CoRal-v3 | ~20% |
| hviske-v4 | + nota + ftspeech | 6.7% (parliamentary) |
| hviske-v5 | + VoxPopuli + nst-da + Common Voice | 14.1% (multi-domain avg) |
Evaluated on 200 samples per dataset (WER / CER):
| Dataset | Domain | WER | CER |
|---|---|---|---|
| VoxPopuli | European Parliament | 11.8% | 6.3% |
| nota | Broadcast media | 5.7% | 1.8% |
| ftspeech | Danish Parliament | 6.4% | 3.1% |
| CoRal-v3 read_aloud | Read-aloud speech | 19.2% | 7.0% |
| CoRal-v3 conversation | Conversational | 17.1% | 9.3% |
| nst-da | General Danish | 14.0% | 8.4% |
| Common Voice 17 | Crowd-sourced | 24.3% | 7.4% |
| Average | 14.1% | 6.2% |
| Dataset | Samples | Hours | Description |
|---|---|---|---|
| VoxPopuli Danish | 1,775,578 | ~13,600 | European Parliament recordings |
| ftspeech | 995,677 | ~1,400 | Danish Parliament (Folketinget) |
| CoRal-v3 read_aloud | 299,255 | ~400 | Read-aloud Danish speech |
| nst-da | 182,605 | ~250 | NST Danish speech corpus |
| CoRal-v3 conversation | 147,249 | ~200 | Conversational Danish |
| nota | 98,600 | ~270 | Danish broadcast media |
| Common Voice 17 | 3,484 | ~5 | Crowd-sourced Danish |
| Total | ~3.5M | ~16,000 |
A unified version of all training data is available at syvai/danish-asr-unified.
pip install transformers torch soundfile librosa
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import soundfile as sf
processor = AutoProcessor.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = model.to("cuda") # optional, for GPU inference
audio, sr = sf.read("audio.wav")
transcriptions = model.transcribe(
processor=processor,
audio_arrays=[audio],
sample_rates=[sr],
language="da",
punctuation=True,
)
print(transcriptions[0])
import soundfile as sf
files = ["file1.wav", "file2.wav", "file3.wav"]
arrays, rates = [], []
for f in files:
audio, sr = sf.read(f)
arrays.append(audio)
rates.append(sr)
transcriptions = model.transcribe(
processor=processor,
audio_arrays=arrays,
sample_rates=rates,
language="da",
punctuation=True,
)
for f, t in zip(files, transcriptions):
print(f"{f}: {t}")
For production workloads, serve the model with vLLM for significantly higher throughput:
pip install vllm
vllm serve syvai/hviske-v5 --trust-remote-code
Then send requests:
import requests, base64, soundfile as sf, io
audio, sr = sf.read("audio.wav")
buf = io.BytesIO()
sf.write(buf, audio, sr, format="WAV")
audio_b64 = base64.b64encode(buf.getvalue()).decode()
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "syvai/hviske-v5",
"messages": [{"role": "user", "content": [
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}}
]}],
})
print(response.json()["choices"][0]["message"]["content"])
This model is released under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).