Gradio

Language

en ko ja es fr de zh uk ru tr eu

Whisper-WebUI

Upload File here

Input Folder Path (Optional)

Optional: Specify the folder path where the input files are located, if you prefer to use local files instead of uploading them. Leave this field empty if you do not wish to use a local path.

When using Input Folder Path above, whether to include all files in the subdirectory or not.

Include Subdirectory Files

When using Input Folder Path above, whether to save output in the same directory as inputs or not, in addition to the original output directory.

Save outputs at same directory

Model

Language

File Format

Translate to English?

Add a timestamp to the end of the filename

Beam Size

Beam size for decoding

Log Probability Threshold

Threshold for average log probability of sampled tokens

No Speech Threshold

Threshold for detecting silence

Compute Type

Computation type for transcription

Best Of

Number of candidates when sampling

Patience

Beam search patience factor

Use previous output as prompt for next window

Condition On Previous Text

Prompt Reset On Temperature

Temperature threshold for resetting prompt

0 1

Initial Prompt

Initial prompt for first window

Temperature

Temperature for sampling

0 1

Compression Ratio Threshold

Threshold for gzip compression ratio

Length Penalty

Exponential length penalty

Repetition Penalty

Penalty for repeated tokens

No Repeat N-gram Size

Size of n-grams to prevent repetition

Prefix

Prefix text for first window

Suppress blank outputs at start of sampling

Suppress Blank

Suppress Tokens

Token IDs to suppress

Max Initial Timestamp

Maximum initial timestamp

Extract word-level timestamps

Word Timestamps

Prepend Punctuations

Punctuations to merge with next word

Append Punctuations

Punctuations to merge with previous word

Max New Tokens

Maximum number of new tokens per chunk

Chunk Length (s)

Length of audio segments in seconds

Hallucination Silence Threshold (sec)

Threshold for skipping silent periods in hallucination detection

Hotwords

Hotwords/hint phrases for the model

Language Detection Threshold

Threshold for language detection probability

Language Detection Segments

Number of segments for language detection

Batch Size

Batch size for processing

Enabling this will remove background music

Enable Background Music Remover Filter

Model

Device

Segment Size

Segment size for UVR model

Save separated files to output

Offload sub model after removing background music

Enable this to transcribe only detected voice

Enable Silero VAD Filter

Speech Threshold

Lower it to be more sensitive to small sounds.

0 1

Minimum Speech Duration (ms)

Final speech chunks shorter than this time are thrown out

Maximum Speech Duration (s)

Maximum duration of speech chunks in "seconds".

Minimum Silence Duration (ms)

In the end of each speech chunk wait for this time before separating it

Speech Padding (ms)

Final speech chunks are padded by this time each side

Enable Diarization

Device

HuggingFace Token

This is only needed the first time you download the model

Output

Downloadable output file

Youtube Link

Youtube Thumbnail

Youtube Title

Youtube Description

Model

Language

File Format

Translate to English?

Add a timestamp to the end of the filename

Beam Size

Beam size for decoding

Log Probability Threshold

Threshold for average log probability of sampled tokens

No Speech Threshold

Threshold for detecting silence

Compute Type

Computation type for transcription

Best Of

Number of candidates when sampling

Patience

Beam search patience factor

Use previous output as prompt for next window

Condition On Previous Text

Prompt Reset On Temperature

Temperature threshold for resetting prompt

0 1

Initial Prompt

Initial prompt for first window

Temperature

Temperature for sampling

0 1

Compression Ratio Threshold

Threshold for gzip compression ratio

Length Penalty

Exponential length penalty

Repetition Penalty

Penalty for repeated tokens

No Repeat N-gram Size

Size of n-grams to prevent repetition

Prefix

Prefix text for first window

Suppress blank outputs at start of sampling

Suppress Blank

Suppress Tokens

Token IDs to suppress

Max Initial Timestamp

Maximum initial timestamp

Extract word-level timestamps

Word Timestamps

Prepend Punctuations

Punctuations to merge with next word

Append Punctuations

Punctuations to merge with previous word

Max New Tokens

Maximum number of new tokens per chunk

Chunk Length (s)

Length of audio segments in seconds

Hallucination Silence Threshold (sec)

Threshold for skipping silent periods in hallucination detection

Hotwords

Hotwords/hint phrases for the model

Language Detection Threshold

Threshold for language detection probability

Language Detection Segments

Number of segments for language detection

Batch Size

Batch size for processing

Enabling this will remove background music

Enable Background Music Remover Filter

Model

Device

Segment Size

Segment size for UVR model

Save separated files to output

Offload sub model after removing background music

Enable this to transcribe only detected voice

Enable Silero VAD Filter

Speech Threshold

Lower it to be more sensitive to small sounds.

0 1

Minimum Speech Duration (ms)

Final speech chunks shorter than this time are thrown out

Maximum Speech Duration (s)

Maximum duration of speech chunks in "seconds".

Minimum Silence Duration (ms)

In the end of each speech chunk wait for this time before separating it

Speech Padding (ms)

Final speech chunks are padded by this time each side

Enable Diarization

Device

HuggingFace Token

This is only needed the first time you download the model

Output

Downloadable output file

Record with Mic

Model

Language

File Format

Translate to English?

Add a timestamp to the end of the filename

Beam Size

Beam size for decoding

Log Probability Threshold

Threshold for average log probability of sampled tokens

No Speech Threshold

Threshold for detecting silence

Compute Type

Computation type for transcription

Best Of

Number of candidates when sampling

Patience

Beam search patience factor

Use previous output as prompt for next window

Condition On Previous Text

Prompt Reset On Temperature

Temperature threshold for resetting prompt

0 1

Initial Prompt

Initial prompt for first window

Temperature

Temperature for sampling

0 1

Compression Ratio Threshold

Threshold for gzip compression ratio

Length Penalty

Exponential length penalty

Repetition Penalty

Penalty for repeated tokens

No Repeat N-gram Size

Size of n-grams to prevent repetition

Prefix

Prefix text for first window

Suppress blank outputs at start of sampling

Suppress Blank

Suppress Tokens

Token IDs to suppress

Max Initial Timestamp

Maximum initial timestamp

Extract word-level timestamps

Word Timestamps

Prepend Punctuations

Punctuations to merge with next word

Append Punctuations

Punctuations to merge with previous word

Max New Tokens

Maximum number of new tokens per chunk

Chunk Length (s)

Length of audio segments in seconds

Hallucination Silence Threshold (sec)

Threshold for skipping silent periods in hallucination detection

Hotwords

Hotwords/hint phrases for the model

Language Detection Threshold

Threshold for language detection probability

Language Detection Segments

Number of segments for language detection

Batch Size

Batch size for processing

Enabling this will remove background music

Enable Background Music Remover Filter

Model

Device

Segment Size

Segment size for UVR model

Save separated files to output

Offload sub model after removing background music

Enable this to transcribe only detected voice

Enable Silero VAD Filter

Speech Threshold

Lower it to be more sensitive to small sounds.

0 1

Minimum Speech Duration (ms)

Final speech chunks shorter than this time are thrown out

Maximum Speech Duration (s)

Maximum duration of speech chunks in "seconds".

Minimum Silence Duration (ms)

In the end of each speech chunk wait for this time before separating it

Speech Padding (ms)

Final speech chunks are padded by this time each side

Enable Diarization

Device

HuggingFace Token

This is only needed the first time you download the model

Output

Downloadable output file

Upload Subtitle Files to translate here

Your Auth Key (API KEY)

Source Language

Target Language

Pro User?

Add a timestamp to the end of the filename

Output

Downloadable output file

Model

Source Language

Target Language

Max Length Per Line

Add a timestamp to the end of the filename

Output

Downloadable output file

VRAM usage for each model

Model name	Required VRAM
nllb-200-3.3B	~16GB
nllb-200-1.3B	~8GB
nllb-200-distilled-600M	~4GB

Note: Be mindful of your VRAM! The table above provides an approximate VRAM usage for each model.

Upload Audio Files to separate background music

Device

Model

Segment Size

Save separated files to output

Instrumental

Vocals

Built with Gradio logo