Parakeet RNNT 110M Danish β€” ONNX for transformers.js

ONNX export of nvidia/parakeet-rnnt-110m-da-dk for use with transformers.js (v4 branch with Nemo Conformer TDT/RNNT support).

Model Details

Base model nvidia/parakeet-rnnt-110m-da-dk
Architecture Conformer encoder + RNN-T decoder (110M params)
Language Danish (da)
License CC-BY-4.0

Files

File Precision Encoder Decoder Total
FP32 (original) fp32 456 MB 16 MB 472 MB
FP16 fp16 228 MB 8 MB 236 MB
INT8+FP16 q8 145 MB 7 MB 152 MB

ONNX file names

Precision Encoder Decoder
fp32 onnx/encoder_model.onnx onnx/decoder_model_merged.onnx
fp16 onnx/encoder_model_fp16.onnx onnx/decoder_model_merged_fp16.onnx
int8+fp16 onnx/encoder_model_quantized.onnx onnx/decoder_model_merged_quantized.onnx

INT8 quantization details

The INT8 models use selective dynamic quantization: only feed-forward and projection MatMul/Gemm layers are quantized to int8. Attention score computations (QK, attnV) and normalization layers (LayerNorm, biases) are kept in fp16.

How It Works

This model uses the nemo-conformer-tdt model type in transformers.js. The RNNT architecture is a subset of TDT β€” the only difference is that RNNT has no duration prediction head. The transformers.js decoder loop automatically falls back to standard RNNT frame advancement when no duration logits are present.

Usage with transformers.js

Note: Requires the v4-nemo-conformer-tdt-main branch of transformers.js, which adds Nemo Conformer TDT/RNNT support.

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline(
  'automatic-speech-recognition',
  'hlevring/parakeet-rnnt-110m-da-dk-onnx',
);

const result = await transcriber(audioData);
console.log(result.text);

Or with explicit model loading for more control:

import {
  AutoProcessor,
  AutoTokenizer,
  NemoConformerForTDT,
  AutomaticSpeechRecognitionPipeline,
} from '@huggingface/transformers';

const modelId = 'hlevring/parakeet-rnnt-110m-da-dk-onnx';

const [processor, tokenizer, model] = await Promise.all([
  AutoProcessor.from_pretrained(modelId),
  AutoTokenizer.from_pretrained(modelId),
  NemoConformerForTDT.from_pretrained(modelId, {
    dtype: { encoder_model: 'fp16', decoder_model_merged: 'fp16' },
  }),
]);

const pipe = new AutomaticSpeechRecognitionPipeline({
  task: 'automatic-speech-recognition',
  model,
  processor,
  tokenizer,
});

const result = await pipe(audioData, { return_timestamps: true });
console.log(result.text);
console.log(result.words);

Available dtype options

  • 'fp32' β€” Full precision (472 MB total)
  • 'fp16' β€” Half precision (236 MB total)
  • 'q8' β€” INT8 quantized with fp16 remainder (152 MB total)

You can also mix dtypes per component:

dtype: { encoder_model: 'fp16', decoder_model_merged: 'q8' }

Model Architecture

  • Encoder: Conformer with 80-dim log-mel features, subsampling factor 8
  • Decoder: Single-layer LSTM (hidden size 640) with RNN-T joint network
  • Vocab: 44 SentencePiece tokens (Danish characters) + 1 blank token = 45 total
  • Sample rate: 16 kHz

Audio Preprocessing

Parameter Value
Sample rate 16000 Hz
Mel bins 80
FFT size 512
Window length 400 (25ms)
Hop length 160 (10ms)
Preemphasis 0.97

ONNX Export Details

Exported from NeMo 2.7.0 using model.export(). The encoder and decoder+joint network are exported as separate ONNX files.

ONNX I/O Names

Encoder:

  • Inputs: audio_signal [B, 80, T], length [B]
  • Outputs: outputs [B, 512, T'], encoded_lengths [B]

Decoder:

  • Inputs: encoder_outputs [B, 512, 1], targets [B, 1], target_length [B], input_states_1 [1, B, 640], input_states_2 [1, B, 640]
  • Outputs: outputs [B, 1, 1, 45], prednet_lengths [B], output_states_1 [1, B, 640], output_states_2 [1, B, 640]

Acknowledgments

  • Original model by NVIDIA NeMo
  • transformers.js Nemo Conformer TDT/RNNT support by @ysdede
Downloads last month
190
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hlevring/parakeet-rnnt-110m-da-dk-onnx

Quantized
(1)
this model