Parakeet RNNT 110M Danish — ONNX for transformers.js

ONNX export of nvidia/parakeet-rnnt-110m-da-dk for use with transformers.js (v4 branch with Nemo Conformer TDT/RNNT support).

Model Details


Base model	nvidia/parakeet-rnnt-110m-da-dk
Architecture	Conformer encoder + RNN-T decoder (110M params)
Language	Danish (da)
License	CC-BY-4.0

Files

File	Precision	Encoder	Decoder	Total
FP32 (original)	fp32	456 MB	16 MB	472 MB
FP16	fp16	228 MB	8 MB	236 MB
INT8+FP16	q8	145 MB	7 MB	152 MB

ONNX file names

Precision	Encoder	Decoder
fp32	`onnx/encoder_model.onnx`	`onnx/decoder_model_merged.onnx`
fp16	`onnx/encoder_model_fp16.onnx`	`onnx/decoder_model_merged_fp16.onnx`
int8+fp16	`onnx/encoder_model_quantized.onnx`	`onnx/decoder_model_merged_quantized.onnx`

INT8 quantization details

The INT8 models use selective dynamic quantization: only feed-forward and projection MatMul/Gemm layers are quantized to int8. Attention score computations (QK, attnV) and normalization layers (LayerNorm, biases) are kept in fp16.

How It Works

This model uses the nemo-conformer-tdt model type in transformers.js. The RNNT architecture is a subset of TDT — the only difference is that RNNT has no duration prediction head. The transformers.js decoder loop automatically falls back to standard RNNT frame advancement when no duration logits are present.

Usage with transformers.js

Note: Requires the v4-nemo-conformer-tdt-main branch of transformers.js, which adds Nemo Conformer TDT/RNNT support.

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline(
  'automatic-speech-recognition',
  'hlevring/parakeet-rnnt-110m-da-dk-onnx',
);

const result = await transcriber(audioData);
console.log(result.text);

Or with explicit model loading for more control:

import {
  AutoProcessor,
  AutoTokenizer,
  NemoConformerForTDT,
  AutomaticSpeechRecognitionPipeline,
} from '@huggingface/transformers';

const modelId = 'hlevring/parakeet-rnnt-110m-da-dk-onnx';

const [processor, tokenizer, model] = await Promise.all([
  AutoProcessor.from_pretrained(modelId),
  AutoTokenizer.from_pretrained(modelId),
  NemoConformerForTDT.from_pretrained(modelId, {
    dtype: { encoder_model: 'fp16', decoder_model_merged: 'fp16' },
  }),
]);

const pipe = new AutomaticSpeechRecognitionPipeline({
  task: 'automatic-speech-recognition',
  model,
  processor,
  tokenizer,
});

const result = await pipe(audioData, { return_timestamps: true });
console.log(result.text);
console.log(result.words);

Available dtype options

'fp32' — Full precision (472 MB total)
'fp16' — Half precision (236 MB total)
'q8' — INT8 quantized with fp16 remainder (152 MB total)

You can also mix dtypes per component:

dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }

Model Architecture

Encoder: Conformer with 80-dim log-mel features, subsampling factor 8
Decoder: Single-layer LSTM (hidden size 640) with RNN-T joint network
Vocab: 44 SentencePiece tokens (Danish characters) + 1 blank token = 45 total
Sample rate: 16 kHz

Audio Preprocessing

Parameter	Value
Sample rate	16000 Hz
Mel bins	80
FFT size	512
Window length	400 (25ms)
Hop length	160 (10ms)
Preemphasis	0.97

ONNX Export Details

Exported from NeMo 2.7.0 using model.export(). The encoder and decoder+joint network are exported as separate ONNX files.

ONNX I/O Names

Encoder:

Inputs: audio_signal [B, 80, T], length [B]
Outputs: outputs [B, 512, T'], encoded_lengths [B]

Decoder:

Inputs: encoder_outputs [B, 512, 1], targets [B, 1], target_length [B], input_states_1 [1, B, 640], input_states_2 [1, B, 640]
Outputs: outputs [B, 1, 1, 45], prednet_lengths [B], output_states_1 [1, B, 640], output_states_2 [1, B, 640]

Acknowledgments

Original model by NVIDIA NeMo
transformers.js Nemo Conformer TDT/RNNT support by @ysdede

Downloads last month: 190

Model tree for hlevring/parakeet-rnnt-110m-da-dk-onnx

Base model

nvidia/parakeet-rnnt-110m-da-dk

Quantized

(1)

this model