Parakeet RNNT 110M Danish β ONNX for transformers.js
ONNX export of nvidia/parakeet-rnnt-110m-da-dk for use with transformers.js (v4 branch with Nemo Conformer TDT/RNNT support).
Model Details
| Base model | nvidia/parakeet-rnnt-110m-da-dk |
| Architecture | Conformer encoder + RNN-T decoder (110M params) |
| Language | Danish (da) |
| License | CC-BY-4.0 |
Files
| File | Precision | Encoder | Decoder | Total |
|---|---|---|---|---|
| FP32 (original) | fp32 | 456 MB | 16 MB | 472 MB |
| FP16 | fp16 | 228 MB | 8 MB | 236 MB |
| INT8+FP16 | q8 | 145 MB | 7 MB | 152 MB |
ONNX file names
| Precision | Encoder | Decoder |
|---|---|---|
| fp32 | onnx/encoder_model.onnx |
onnx/decoder_model_merged.onnx |
| fp16 | onnx/encoder_model_fp16.onnx |
onnx/decoder_model_merged_fp16.onnx |
| int8+fp16 | onnx/encoder_model_quantized.onnx |
onnx/decoder_model_merged_quantized.onnx |
INT8 quantization details
The INT8 models use selective dynamic quantization: only feed-forward and projection MatMul/Gemm layers are quantized to int8. Attention score computations (QK, attnV) and normalization layers (LayerNorm, biases) are kept in fp16.
How It Works
This model uses the nemo-conformer-tdt model type in transformers.js. The RNNT architecture is a subset of TDT β the only difference is that RNNT has no duration prediction head. The transformers.js decoder loop automatically falls back to standard RNNT frame advancement when no duration logits are present.
Usage with transformers.js
Note: Requires the v4-nemo-conformer-tdt-main branch of transformers.js, which adds Nemo Conformer TDT/RNNT support.
import { pipeline } from '@huggingface/transformers';
const transcriber = await pipeline(
'automatic-speech-recognition',
'hlevring/parakeet-rnnt-110m-da-dk-onnx',
);
const result = await transcriber(audioData);
console.log(result.text);
Or with explicit model loading for more control:
import {
AutoProcessor,
AutoTokenizer,
NemoConformerForTDT,
AutomaticSpeechRecognitionPipeline,
} from '@huggingface/transformers';
const modelId = 'hlevring/parakeet-rnnt-110m-da-dk-onnx';
const [processor, tokenizer, model] = await Promise.all([
AutoProcessor.from_pretrained(modelId),
AutoTokenizer.from_pretrained(modelId),
NemoConformerForTDT.from_pretrained(modelId, {
dtype: { encoder_model: 'fp16', decoder_model_merged: 'fp16' },
}),
]);
const pipe = new AutomaticSpeechRecognitionPipeline({
task: 'automatic-speech-recognition',
model,
processor,
tokenizer,
});
const result = await pipe(audioData, { return_timestamps: true });
console.log(result.text);
console.log(result.words);
Available dtype options
'fp32'β Full precision (472 MB total)'fp16'β Half precision (236 MB total)'q8'β INT8 quantized with fp16 remainder (152 MB total)
You can also mix dtypes per component:
dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }
Model Architecture
- Encoder: Conformer with 80-dim log-mel features, subsampling factor 8
- Decoder: Single-layer LSTM (hidden size 640) with RNN-T joint network
- Vocab: 44 SentencePiece tokens (Danish characters) + 1 blank token = 45 total
- Sample rate: 16 kHz
Audio Preprocessing
| Parameter | Value |
|---|---|
| Sample rate | 16000 Hz |
| Mel bins | 80 |
| FFT size | 512 |
| Window length | 400 (25ms) |
| Hop length | 160 (10ms) |
| Preemphasis | 0.97 |
ONNX Export Details
Exported from NeMo 2.7.0 using model.export(). The encoder and decoder+joint network are exported as separate ONNX files.
ONNX I/O Names
Encoder:
- Inputs:
audio_signal[B, 80, T],length[B] - Outputs:
outputs[B, 512, T'],encoded_lengths[B]
Decoder:
- Inputs:
encoder_outputs[B, 512, 1],targets[B, 1],target_length[B],input_states_1[1, B, 640],input_states_2[1, B, 640] - Outputs:
outputs[B, 1, 1, 45],prednet_lengths[B],output_states_1[1, B, 640],output_states_2[1, B, 640]
Acknowledgments
- Original model by NVIDIA NeMo
- transformers.js Nemo Conformer TDT/RNNT support by @ysdede
- Downloads last month
- 190
Model tree for hlevring/parakeet-rnnt-110m-da-dk-onnx
Base model
nvidia/parakeet-rnnt-110m-da-dk