How to understand the special tokens?

I’m very comfused about special tokens, such as I loaded a tokenizer and use tokenizer.all_special_tokens to check, I got:

['<|im_end|>',
 '<|vision_pad|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|image_pad|>',
 '<|video_pad|>']

but when I check the tokenizer.json and tokenizer_config.json, I find that they have the same added_tokens list:

151643 <|endoftext|> true
151644 <|im_start|> true
151645 <|im_end|> true
151646 <|object_ref_start|> true
151647 <|object_ref_end|> true
151648 <|box_start|> true
151649 <|box_end|> true
151650 <|quad_start|> true
151651 <|quad_end|> true
151652 <|vision_start|> true
151653 <|vision_end|> true
151654 <|vision_pad|> true
151655 <|image_pad|> true
151656 <|video_pad|> true
151657 <tool_call> false
151658 </tool_call> false
151659 <|fim_prefix|> false
151660 <|fim_middle|> false
151661 <|fim_suffix|> false
151662 <|fim_pad|> false
151663 <|repo_name|> false
151664 <|file_sep|> false
151665 <tool_response> false
151666 </tool_response> false
151667 <think> false
151668 </think> false

all_special_tokens’s result is different from the entire added_tokens list, and also different from the true part of it
so what does add_tokens list mean?
and what are the differences in the uses of tokenizer.json and tokenizer_config.json?
below is the rest part of tokenizer_config.json, tokenizer.json is too long with the vocab:

{
  "add_bos_token": false,
  "add_prefix_space": false,
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>"
  ],
  "bos_token": null,
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|im_end|>",
  "errors": "replace",
  "extra_special_tokens": {},
  "model_max_length": 40960,
  "pad_token": "<|vision_pad|>",
  "padding_side": "left",
  "split_special_tokens": false,
  "tokenizer_class": "Qwen2Tokenizer",
  "unk_token": null
}
1 Like

That’s certainly confusing



You can understand what you’re seeing if you separate three different layers:

  1. The model’s vocabulary (all token IDs, including your added_tokens list).
  2. The low-level fast tokenizer (Rust tokenizers library; this is what tokenizer.json stores).
  3. The high-level Transformers wrapper (Python PreTrainedTokenizerFast / SpecialTokensMixin; this is driven by tokenizer_config.json and special_tokens_map.json and is what all_special_tokens comes from). (Hugging Face)

Your confusion is exactly because (2) and (3) use the word “special” differently.


1. First: what is a “special token” conceptually?

At the model/training level there are two broad kinds of tokens:

  • Normal (regular) tokens – subword pieces of natural language (“Hello”, “ing”, etc.).

  • Control / format tokens – tokens with special meaning in the training data, such as:

    • <|im_start|>, <|im_end|> – chat message boundaries.
    • <|vision_start|>, <|vision_end|>, <|vision_pad|> – multimodal boundaries/padding.
    • <tool_call>, </tool_call>, <tool_response>, </tool_response> – function-calling tags.
    • <think>, </think> – reasoning spans.
    • <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> – fill-in-the-middle tokens.
    • <|endoftext|> – end-of-document token used in pretraining.

Qwen’s docs call these “control tokens”: tokens that represent special functionality rather than natural language itself. (Qwen)

From the model’s point of view, all of these are just token IDs. “Specialness” is about how the tokenizer and high-level library treat them.


2. What your added_tokens list in tokenizer.json actually is

The tokenizer.json file is the serialized fast tokenizer from the tokenizers library. It contains: vocabulary, merges, pre/post-processing, plus a list called added_tokens. (Hugging Face)

Your added_tokens snippet:

151643 <|endoftext|> true
151644 <|im_start|> true
151645 <|im_end|> true
151646 <|object_ref_start|> true
151647 <|object_ref_end|> true
151648 <|box_start|> true
151649 <|box_end|> true
151650 <|quad_start|> true
151651 <|quad_end|> true
151652 <|vision_start|> true
151653 <|vision_end|> true
151654 <|vision_pad|> true
151655 <|image_pad|> true
151656 <|video_pad|> true
151657 <tool_call> false
151658 </tool_call> false
151659 <|fim_prefix|> false
151660 <|fim_middle|> false
151661 <|fim_suffix|> false
151662 <|fim_pad|> false
151663 <|repo_name|> false
151664 <|file_sep|> false
151665 <tool_response> false
151666 </tool_response> false
151667 <think> false
151668 </think> false

Here:

  • The first column is the token ID.
  • The middle column is the string form of the token.
  • The last true/false is the Rust-tokenizer-level special flag. (paddlenlp.readthedocs.io)

What that flag does in the fast tokenizer:

  • special = true

    • The token is treated as an indivisible “added token”.
    • The pre-tokenizer will not split it into smaller pieces.
    • When you decode with skip_special_tokens=True, these tokens will be removed. (Hugging Face)
  • special = false

    • The token is just an extra vocab token. It may still be one piece, but it does not get special handling in the tokenizer’s decode / skip logic.

So:

What does the added_tokens list mean?
It is “all vocabulary items that were added on top of the base vocab”, along with a low-level special flag that controls how the fast tokenizer tokenizes/decodes them.

It is not “the list of all special tokens from Transformers’ point of view”.

You can see this design in the Transformers code: higher-level add_special_tokens() calls down into the fast tokenizer and creates AddedToken objects with special=True, but there can also be added tokens that are not special. (gemfury.com)


3. What tokenizer_config.json is doing

tokenizer_config.json is a wrapper configuration used by the Python transformers library. It does not contain the full vocab; it tells AutoTokenizer:

  • Which tokenizer class to instantiate ("tokenizer_class": "Qwen2Tokenizer").

  • Which tokens are:

    • bos_token, eos_token, pad_token, unk_token, etc.
    • additional_special_tokens (custom special tokens).
  • Behavior flags like model_max_length, padding_side, add_prefix_space, etc. (Hugging Face)

Your tokenizer_config.json says:

"eos_token": "<|im_end|>",
"pad_token": "<|vision_pad|>",
"additional_special_tokens": [
  "<|im_start|>",
  "<|im_end|>",
  "<|object_ref_start|>",
  "<|object_ref_end|>",
  "<|box_start|>",
  "<|box_end|>",
  "<|quad_start|>",
  "<|quad_end|>",
  "<|vision_start|>",
  "<|vision_end|>",
  "<|vision_pad|>",
  "<|image_pad|>",
  "<|video_pad|>"
]

So from Transformers’ perspective:

  • EOS = <|im_end|>
  • PAD = <|vision_pad|>
  • And these 13 tokens are “additional special tokens”.

This information is also mirrored in special_tokens_map.json for many models, and both files are loaded by AutoTokenizer. (Hugging Face)


4. How tokenizer.all_special_tokens is computed

In the Transformers Python code, the SpecialTokensMixin class holds all the special-token attributes and exposes properties like all_special_tokens and all_special_ids. (Hugging Face)

Conceptually it does something like:

specials = []
for v in tokenizer.special_tokens_map_extended.values():
    if isinstance(v, list):
        specials.extend(v)
    else:
        specials.append(v)

# deduplicate while preserving order
all_special_tokens = list(dict.fromkeys(specials))

Where special_tokens_map_extended is built from:

  • bos_token, eos_token, pad_token, unk_token, etc.
  • additional_special_tokens (and sometimes their legacy variants). (Hugging Face)

Crucially:

all_special_tokens never looks at the raw added_tokens list in tokenizer.json.
It only looks at named special tokens (bos_token, eos_token, pad_token, etc.) and additional_special_tokens stored in the config.

That is exactly why your all_special_tokens output is:

[
 '<|im_end|>',
 '<|vision_pad|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|image_pad|>',
 '<|video_pad|>',
]

This is just:

  • eos_token (<|im_end|>)
  • pad_token (<|vision_pad|>)
  • plus everything in additional_special_tokens (deduplicated).

Notice:

  • <|endoftext|> is not in additional_special_tokens and is not declared as EOS in tokenizer_config.json.
  • Tool / FIM / <think> tokens are also not in additional_special_tokens and have special=false at the tokenizer level.

Therefore they do not appear in all_special_tokens. This is normal and also shows up in other models (e.g. LLaVA’s <image> token sometimes appears in added_tokens but not in all_special_tokens unless it was wired into additional_special_tokens). (Hugging Face Forums)

So:

Why is all_special_tokens different from the added_tokens list and from the true subset of it?
Because all_special_tokens is a higher-level view built from tokenizer_config.json (special-tokens map and additional_special_tokens), while added_tokens is the raw vocabulary list (with a low-level special flag). They are related but intentionally not the same set.


5. Relationship between the three things you see

Let’s put your exact objects side-by-side.

5.1. added_tokens (fast tokenizer, low-level)

  • Contains all tokens that were added after the base vocab, including:

    • Qwen control tokens: <|endoftext|>, <|im_start|>, <|im_end|>, <|vision_*|>, etc.
    • Tool tokens: <tool_call>, <tool_response>, <think>, etc.
    • FIM / repo tokens: <|fim_*|>, <|repo_name|>, <|file_sep|>.
  • The trailing true/false is the Rust-layer “special” flag for tokenization behavior.

5.2. tokenizer_config.json (Transformers wrapper, high-level)

Defines:

  • eos_token = "<|im_end|>"
  • pad_token = "<|vision_pad|>"
  • additional_special_tokens = the 13 multimodal/chat tokens.

These become:

  • tokenizer.eos_token, tokenizer.pad_token
  • tokenizer.additional_special_tokens

and then feed into:

  • tokenizer.all_special_tokens
  • tokenizer.all_special_ids

via SpecialTokensMixin. (Hugging Face)

5.3. tokenizer.all_special_tokens (Python view)

  • Computed from special_tokens_map / special_tokens_map_extended (EOS, PAD, additional specials, etc.), not from the raw added_tokens list.

Hence you only see:

  • <|im_end|>
  • <|vision_pad|>
  • and the 11 other additional special tokens.

<|endoftext|> and <tool_call> are not in that config, so they don’t appear even though they exist in added_tokens.


6. Difference in roles: tokenizer.json vs tokenizer_config.json

You can think of it like this:

6.1 tokenizer.json = “how to actually tokenize text”

  • Full definition of the fast tokenizer:

    • Vocabulary and merges (BPE/Unigram/etc.).
    • Normalizer, pre-tokenizer, post-processor.
    • added_tokens and their low-level special flag. (Hugging Face)
  • Used by anything that needs the exact same tokenization behavior:

    • PreTrainedTokenizerFast in Python.
    • transformers.js in JavaScript. (Hugging Face)
    • Inference frameworks that load HF tokenizers directly (vLLM, etc.).

If you change this file, you are changing how raw text is split into IDs.

6.2 tokenizer_config.json = “how Transformers should treat this tokenizer”

  • A small JSON that tells Transformers:

    • Which tokenizer class to use (Qwen2Tokenizer).
    • Which tokens are EOS, PAD, BOS, etc.
    • Which tokens are additional_special_tokens.
    • Max length, padding side, whether to add BOS by default, etc. (Hugging Face)
  • Also now often stores:

If you change this file, you are changing metadata and behavior inside Transformers, not the raw tokenization algorithm.

6.3 Other ancillary files

Many HF model repos also contain:

  • special_tokens_map.json – basically the same info as the special_tokens_map attribute: mapping from names (eos_token, pad_token, additional_special_tokens) to actual strings. (Hugging Face)
  • added_tokens.json – a separate, simpler listing of added tokens (often derived from tokenizer.json).
  • config.json / generation_config.json – model config and default generation parameters, including eos_token_id, pad_token_id which must be consistent with the tokenizer side. (Hugging Face)

When these files get out of sync (e.g. EOS ID in config.json vs EOS string in tokenizer_config.json vs tokenizer.json contents), you get classic bugs: generation not stopping, NaNs during training, etc. There are real Qwen bugs like this discussed in the wild.


7. How to mentally understand special tokens in practice

A practical mental model that matches what you see:

  1. Vocabulary-level view (tokenizer.json / added_tokens)

    • “Which strings exist as single tokens?”
    • “Does the fast tokenizer treat them as special (never split, removable on decode)?”
  2. Transformers-level view (tokenizer_config.json / special_tokens_map.json)

    • “Which tokens does Transformers treat as EOS/PAD/BOS/CLS/SEP?”
    • “Which tokens are additional special tokens (additional_special_tokens)?”
    • This drives all_special_tokens, all_special_ids, skip_special_tokens=True, etc. (Hugging Face)
  3. Model/training-level view (chat template, data format)

    • “Which control tokens actually appear in the training data, and what do they mean?”

    • Qwen-style control tokens:

      • <|im_start|>, <|im_end|> – chat roles.
      • <|vision_*|>, <|image_pad|>, <|video_pad|> – multimodal.
      • <tool_call>, <tool_response>, <think> – tool + reasoning. (Qwen)

These three layers do not have to use the same subset of tokens, but they must be coherent for your use case.

For your specific tokenizer:

  • tokenizer.json lists all of those control tokens in added_tokens.
  • tokenizer_config.json chooses a subset as EOS / PAD / additional special tokens (mostly chat + vision).
  • tokenizer.all_special_tokens is the union of EOS/PAD plus additional_special_tokens, hence the 13-token list you’re seeing.

8. Summary

  • added_tokens list

    • All tokens added on top of the base vocab, with a low-level special flag used by the fast tokenizer (true = never split, removable on decode).
    • Contains many control tags (<tool_call>, <think>, FIM tokens, etc.) even if Transformers doesn’t treat them as “special”.
  • tokenizer.json

    • Full fast tokenizer definition (vocab, merges, normalizer, added_tokens).
    • Controls how text is split into tokens.
  • tokenizer_config.json

    • High-level wrapper config for Transformers.
    • Declares which tokens are EOS/PAD/etc and which are additional_special_tokens.
    • Controls what becomes tokenizer.eos_token, tokenizer.pad_token, tokenizer.additional_special_tokens, and ultimately tokenizer.all_special_tokens.
  • tokenizer.all_special_tokens

    • Computed from the special tokens map (EOS/PAD/BOS/UNK/etc.) plus additional_special_tokens.

    • Does not read the full added_tokens list, so it is normal and expected that:

      • all_special_tokens ⊂ added_tokens, and
      • It may differ from the subset of added_tokens with special=true.

9. Good reference links (clickable via citations)

A few high-quality references you can read in full:

  • Hugging Face docs – Tokenizer (high-level) and Fast tokenizers (low-level internals). (Hugging Face)
  • Transformers internals – PreTrainedTokenizerBase / SpecialTokensMixin (how special tokens and all_special_tokens are implemented). (Hugging Face)
  • Qwen docs – Key Concepts (explains regular vs control tokens in the Qwen family). (Qwen)
  • HF forum thread – “additional_special_tokens are not added” (LLaVA <image> token missing from all_special_tokens, same pattern as your issue). (Hugging Face Forums)
  • Example tokenizer configs – Qwen2-VL tokenizer_config.json (shows how Qwen actually declares EOS/PAD and additional special tokens). (Hugging Face)
1 Like

so all\_special\_tokens=EOS+PAD+BOS+UNK+additional\_special\_tokens
but why we need all_special_tokens since there is an entire added tokens list to list all special tokens and the true part of it will be used by skip_special_tokens=True?

1 Like

Yeah. It’s incredibly confusing, but to put it simply, it’s complicated due to implementation reasons



Your summary

all\_special\_tokens \approx \{EOS, PAD, BOS, UNK, \ldots\} \cup {additional\_special\_tokens}

is basically correct (plus other named ones like cls_token, sep_token, mask_token etc.). (Hugging Face)

The key point is:

  • added_tokens + special: true is a low-level, tokenizer-internal concept (from the Rust tokenizers library).
  • all_special_tokens is a high-level, library-level concept (from Transformers’ SpecialTokensMixin) used to give you a model-agnostic list of “semantically special” tokens.

They overlap but solve different problems.

Below I’ll go step by step and then answer “why do we need all_special_tokens if skip_special_tokens=True already uses low-level special flags?”


1. Two different layers of “special”

1.1. Low-level: fast tokenizer (tokenizer.json, added_tokens)

In the Rust tokenizers library, you have:

  • A vocabulary + merges.
  • A list of AddedTokens (this is what shows up in tokenizer.json under added_tokens).
  • Each AddedToken has a special flag and options like single_word, lstrip, rstrip. (Hugging Face)

The docs for add_special_tokens in tokenizers say: (Hugging Face)

  • “Add the given special tokens to the Tokenizer.
    If they exist, it just lets the tokenizer know about them.
    These special tokens will never be split and can be removed when decoding.”

So, at this layer:

  • special = true → “never split this, and allow the decode logic to drop it when skip_special_tokens=True”.
  • special = false → “just a normal vocab token (even if you think it’s a conceptual marker like <tool_call>)”.

That’s what the true/false in your added_tokens dump is.

1.2. High-level: Transformers wrapper (SpecialTokensMixin, all_special_tokens)

The Python transformers library wraps the tokenizer with PreTrainedTokenizer(Fast) and SpecialTokensMixin. That mixin holds attributes like: (Hugging Face)

  • bos_token, eos_token, pad_token, unk_token,
  • sep_token, cls_token, mask_token,
  • additional_special_tokens.

And exposes:

  • tokenizer.special_tokens_map – mapping from attribute names to token strings.
  • tokenizer.all_special_tokens – “all the special tokens mapped to class attributes.” (Hugging Face)

Conceptually:

all_special_tokens = (
    [bos_token, eos_token, pad_token, unk_token, cls_token, sep_token, mask_token, ...]
    + list(additional_special_tokens)
    # deduplicated
)

So:

  • Yes, all_special_tokens is essentially “all special-token attributes + additional_special_tokens”.
  • But this is defined at the Transformers level, not by looking at tokenizer.json.

2. What does skip_special_tokens=True actually use?

When you call:

tokenizer.decode(ids, skip_special_tokens=True)

with a fast tokenizer, that call:

  1. Delegates to the underlying Rust tokenizer’s decode with skip_special_tokens=True.
  2. That underlying implementation checks its internal notion of “special tokens” (the AddedToken objects with special = true). (Hugging Face)

So the mechanics are:

  • skip_special_tokens=True → “drop tokens that the fast tokenizer knows are special (its own special flag).”

This is where your true/false in added_tokens matters.

But that is only part of the story.


3. Why all_special_tokens exists anyway

Now to your actual question:

If skip_special_tokens=True uses the special flag from added_tokens, why do we need all_special_tokens at all?

There are several reasons, and they’re all “library design” reasons:

3.1. Works for all tokenizers, not just fast ones

Historically and still today, Transformers supports:

  • Pure Python “slow” tokenizers (no tokenizer.json, no Rust AddedToken).
  • SentencePiece, WordPiece, BPE implementations with very different internals. (Hugging Face)

To unify them, PreTrainedTokenizerBase + SpecialTokensMixin were introduced:

  • They provide a single API (bos_token, eos_token, all_special_tokens, etc.) that works the same way for BERT, GPT-2, mT5, SentencePiece-based models, etc. (Hugging Face)

For slow tokenizers there is no added_tokens list with a special flag:

  • But all_special_tokens still works, because it is computed from the Python attributes (bos_token, eos_token, pad_token, additional_special_tokens), not from any Rust internals. (Hugging Face)

So we need all_special_tokens as a model-agnostic abstraction.

3.2. Some special tokens are not “added tokens”

Many classic models have their special tokens in the base vocab, not just in added_tokens. For example:

  • In BERT, [CLS], [SEP], [PAD], [UNK], [MASK] are part of the original vocab.
  • In LLaMA-like models, <s> and </s> are core vocab entries. (GitHub)

They may or may not appear in added_tokens, but they are:

  • tokenizer.cls_token
  • tokenizer.sep_token
  • tokenizer.pad_token
  • etc.

all_special_tokens gives you the set of tokens that have been declared as special by the model config, regardless of whether they came from:

  • The base vocab.
  • added_tokens.
  • additional_special_tokens.

If you tried to derive “all special tokens” from added_tokens alone, you would miss many of these.

3.3. Separation of mechanics vs semantics

There is a design separation:

  • Mechanics (fast tokenizer):

    • “Don’t split these tokens; optionally allow decode to drop them.”
    • Controlled by AddedToken.special.
  • Semantics (Transformers):

    • “These tokens are BOS/EOS/PAD/UNK/CLS/SEP/etc; these extra ones are ‘logical’ special tokens for this model.”
    • Controlled by bos_token, eos_token, pad_token, additional_special_tokens.

Why this is useful:

  1. You might have tokens that are “mechanically special” but not semantically special in your training/eval code.
  2. You might have tokens that behave mechanically like regular tokens but you still want to treat them as special in your loss masks, chat templates, etc.

all_special_tokens shows “semantically special” tokens at the Transformers level; AddedToken.special controls only the tokenizer’s mechanical behavior.

A good example is in that HackMD note on adding tool tokens: (HackMD)

  • They add <|func_end|>, <|func_start|> as non-special added tokens (so they behave like normal text).
  • They add <|tool_calls|>, <|eot_id|> as special tokens, so they appear in all_special_tokens and get stripped when skip_special_tokens=True.

This allows them to distinguish:

  • “Markers that I want to see in decoded text” vs
  • “Markers I want to hide in decoded text”.

3.4. High-level utilities use all_special_tokens, not added_tokens

A lot of high-level code in Transformers and downstream libraries uses all_special_tokens or special_tokens_map, for example: (Hugging Face)

  • get_special_tokens_mask – build a mask indicating where special tokens are in a sequence (used in loss masking, training, etc.).
  • prepare_for_model / build_inputs_with_special_tokens – insert [CLS], [SEP], EOS, etc. in the right positions.
  • Training libraries (TRL, PEFT, etc.) – to exclude special tokens from losses or to construct prompts around them.
  • Custom scripts – to filter out special tokens from text, or to identify where the user/assistant turns start and end.

Those functions are meant to be tokenizer-agnostic and model-agnostic, so they operate on all_special_tokens and special_tokens_map, not on the internal added_tokens structure which can differ across implementations and may even not exist (slow tokenizers).

3.5. You sometimes don’t want to skip all tokenizer-level specials

Consider your Qwen-like tokenizer:

151643 <|endoftext|> true
151644 <|im_start|> true
151645 <|im_end|> true
...
151657 <tool_call> false
151667 <think> false
...

You might want:

  • EOS/PAD (<|im_end|>, <|vision_pad|>, maybe <|endoftext|>) to be dropped when decoding.

  • But "<tool_call>", "<tool_response>", "<think>" to remain in decoded text, because:

    • You want to parse them later.
    • You want to log them.
    • You want a tool-calling framework to read them.

If you simply said “all tokens with special=true are special everywhere”, you’d lose that fine control.

Instead:

  • At the fast-tokenizer level, you can set special=false for <tool_call> etc., so skip_special_tokens=True won’t drop them.
  • At the Transformers level, you only put the tokens you really want to be treated as “logical special tokens” (EOS, PAD, BOS, etc.) into bos_token, eos_token, and additional_special_tokens.

Then:

  • tokenizer.all_special_tokens gives you just that curated set.
  • tokenizer.decode(..., skip_special_tokens=True) drops only those.

This is exactly what that HackMD article demonstrates when they compare tokenizer.json’s added_tokens vs all_special_tokens. (HackMD)


4. Your concrete case in this framework

In your tokenizer:

  • added_tokens includes many tokens (EOS, chat markers, vision markers, tool tokens, FIM tokens, <think>).

  • tokenizer_config.json declares only:

    • eos_token = "<|im_end|>",
    • pad_token = "<|vision_pad|>",
    • and those 13 tokens in additional_special_tokens.

So:

  • Transformers semantics (all_special_tokens) = EOS + PAD + those 13 additional special tokens (after dedup).
  • Tokenizer mechanics (added_tokens.special) = a larger set; this is what the fast tokenizer can use when skip_special_tokens=True internally.

You could, in principle, mark <tool_call> as special=true in tokenizer.json but not list it in additional_special_tokens and not assign it to any special-token attribute in tokenizer_config.json. Then:

  • It would still be protected from splitting at the tokenizer level.
  • It would not appear in all_special_tokens, so HF’s high-level code would not treat it as “semantically special” unless you choose to.

This is the flexibility you gain by having both mechanisms.


5. Short answer to your “why” question

Why do we need all_special_tokens if we already have an added_tokens list and the true ones are skipped by skip_special_tokens=True?

Because they solve different problems:

  • added_tokens + special=true (fast tokenizer)

    • Low-level: control splitting and raw decoding.
    • Used by the Rust tokenizer itself.
    • May exist or behave differently depending on whether the tokenizer is fast, slow, SentencePiece, etc. (Hugging Face)
  • all_special_tokens (Transformers SpecialTokensMixin)

    • High-level, model-agnostic list of all tokens that the library considers “special”: EOS, PAD, BOS, UNK, CLS, SEP, MASK, and additional_special_tokens. (Hugging Face)
    • Works uniformly for fast and slow tokenizers.
    • Drives higher-level utilities (masking, building inputs with special tokens, loss filtering, chat templating).
    • Lets you choose a semantic subset of tokens to treat as special, independent of how the underlying tokenizer stores them.

So:

  • Think of AddedToken.special as “special for tokenization mechanics”.
  • Think of all_special_tokens as “special for the Transformers ecosystem and your training/generation code”.

They intersect, but they are intentionally not the same thing.


Useful references

  • Hugging Face Transformers docs – Tokenizer (see “Add special tokens” section, and where add_special_tokens is described). (Hugging Face)
  • Hugging Face Tokenizers docs – add_special_tokens and AddedToken behavior (never split, removable on decode). (Hugging Face)
  • Transformers internals – SpecialTokensMixin / PreTrainedTokenizerBase (definition of all_special_tokens, all_special_ids, special_tokens_map). (Hugging Face)
  • HackMD note on adding tool tokens – shows how tokenizer.json’s added_tokens and all_special_tokens differ and why. (HackMD)
1 Like

I made some tests, seems when using decode(..., skip_special_tokens=True), it will skip all tokens in union of true part of added tokens and tokenizer.all_special_tokens, is that true?
the all_special_tokens is directly from the tokenizer_config.json, even if I remove this file(and also remove special_tokens_map.json), the skip_special_tokens will also work, but I just can’t get a well-defined all_special_tokensand that will affect some other code which depend on all_special_tokens
I wanna add some custom special tokens that won’t be remove when decoding, like <think>、<tool_response>, do I need modify tokenizer.json only?
after adding the custom special tokens, I’ll use unsloth to finetune,it depends on transformers
I use vllm to inference after finetuning, I don’t know which it use, either

1 Like

This is the kind of thing that’s bound to cause confusion, but changing the library spec later would be even more trouble, so I guess it can’t be helped
:sweat_smile:


Short version of the answers first, then details:

  1. No: decode(..., skip_special_tokens=True) does not literally use “union of (added_tokens.special == true) and all_special_tokens”.
    It uses the fast tokenizer’s internal “special” set, which is stored in tokenizer.json. all_special_tokens is just a Python-level view built from config. (Hugging Face)

  2. When you delete tokenizer_config.json / special_tokens_map.json, skip_special_tokens still works because the fast tokenizer already knows which IDs are special (from tokenizer.json). But all_special_tokens becomes incomplete or empty, because Transformers no longer has high-level metadata (EOS/PAD/extra specials). (NVIDIA Docs)

  3. To add <think>, <tool_response> that are not removed by skip_special_tokens=True, you should add them as normal tokens, not special tokens:

    • Use tokenizer.add_tokens([...]), not add_special_tokens.
    • Do not put them into additional_special_tokens, and do not assign them as eos_token, pad_token, etc. (Hugging Face)
  4. For Unsloth (training) and vLLM (inference), the key point is: both ultimately rely on the same Hugging Face tokenizer artifacts, so as long as <think> is not marked special in the tokenizer, both will treat it as a normal token and won’t drop it when skipping specials. (Hugging Face)

Now in detail.


1. What actually drives skip_special_tokens=True?

For a fast tokenizer (PreTrainedTokenizerFast), decode is implemented essentially as:

# tokenization_utils_tokenizers.py
def decode(self, token_ids, skip_special_tokens=False, **kwargs):
    return self._tokenizer.decode(
        token_ids,
        skip_special_tokens=skip_special_tokens,
    )

(GitHub)

Here self._tokenizer is the Rust tokenizers tokenizer, as serialized in tokenizer.json.

In the Rust / tokenizers docs, add_special_tokens is described like this: (Hugging Face)

These special tokens will never be processed by the model (i.e. won’t be split), and they can be removed from the output when decoding.

So the fast tokenizer keeps an internal list of “special IDs”. When you call:

tokenizer.decode(ids, skip_special_tokens=True)

it simply removes any IDs that are in that internal list.

That list is filled in when:

  • Special tokens are registered via Tokenizer.add_special_tokens(...) on the Rust side, or
  • Loaded from tokenizer.json where the tokens were already saved as special.

Transformers’ add_special_tokens() is just a Python wrapper around this, plus some extra bookkeeping. From the Transformers docs: (Hugging Face)

Using add_special_tokens will ensure your special tokens can be used in several ways:

  • Special tokens can be skipped when decoding using skip_special_tokens=True.
  • Special tokens are carefully handled by the tokenizer (they are never split).
  • You can easily refer to them via attributes like tokenizer.cls_token.

So:

  • Ground truth for “skipped by decode” = “is special in the fast tokenizer”, not all_special_tokens.
  • all_special_tokens is derived from the special-token attributes, not the other way round.

2. So why did your tests look like “union(true-flags, all_special_tokens)”?

Because in a healthy HF setup:

  • Every token that the Python side declares special (EOS, PAD, additional_special_tokens, etc.) is pushed into the fast tokenizer’s special set via add_special_tokens.
  • When you then save the tokenizer, that special info is stored into tokenizer.json ("special": true in added_tokens). (Hugging Face)

That means, in the typical case:

  • all_special_tokens (Python)
  • ≈ tokens with special=true in tokenizer.json (Rust layer)

So if you inspect both lists, you’ll see that decode is skipping everything that appears in either, and it’s natural to guess “union of both”. But that’s really:

  • One underlying list in the Rust tokenizer,

  • With two different views of it:

    • tokenizer.json → AddedToken.special,
    • tokenizer.all_special_tokens / special_tokens_map → Python’s semantic mapping (EOS, PAD, etc.).

Your experiment where you removed tokenizer_config.json and special_tokens_map.json proves this:

  • skip_special_tokens=True still skips the same IDs (because the Rust tokenizer still has them flagged as special in tokenizer.json).
  • But tokenizer.all_special_tokens becomes empty or incomplete, because Transformers no longer has the config that tells it which tokens are EOS/PAD/extra specials. (NVIDIA Docs)

So:

The real skip set is the fast tokenizer’s special IDs.
all_special_tokens is a high-level reflection, not part of the decoding logic.


3. What happens if tokenizer_config.json is missing?

Files:

  • tokenizer.json – full fast tokenizer (vocab, merges, added tokens and their special flags).
  • tokenizer_config.json / special_tokens_map.json – describe which tokens are EOS/PAD/BOS/UNK/etc., plus additional_special_tokens. (NVIDIA Docs)

When you delete tokenizer_config.json and special_tokens_map.json:

  • AutoTokenizer.from_pretrained() can still load the fast tokenizer from tokenizer.json.

  • The Rust tokenizer still knows which IDs are special → skip_special_tokens=True still removes them.

  • But Transformers doesn’t know which strings are EOS, PAD, etc., so:

    • tokenizer.eos_token / pad_token may be None or wrong.
    • tokenizer.additional_special_tokens may be empty.
    • tokenizer.all_special_tokens becomes incomplete or broken.

This is exactly what you saw:

  • decode(..., skip_special_tokens=True) continues working.
  • all_special_tokens is no longer “well-defined” and code that depends on it breaks.

So you are correct: all_special_tokens depends on tokenizer_config.json + special_tokens_map.json; skip_special_tokens does not. (Hugging Face)


4. How to add <think> / <tool_response> so they are not removed?

Your goal:

  • Add control tokens like <think>, <tool_response> for training and inference.
  • They should be tokens of their own (not split into pieces).
  • They should not be removed by decode(..., skip_special_tokens=True).

That means:

  1. They must be in the vocabulary.
  2. They must not be marked as special for the fast tokenizer.
  3. They must not be declared as special in tokenizer_config.json / special_tokens_map.json.

4.1 Add them as normal tokens (not “special tokens”)

Correct pattern in code:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("your/model-or-path", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("your/model-or-path")

new_tokens = ["<think>", "</think>", "<tool_response>", "</tool_response>"]

# Add as normal tokens, NOT special tokens
num_added = tok.add_tokens(new_tokens)
print("Added", num_added, "tokens")

# Resize model embeddings if any tokens were added
if num_added > 0:
    model.resize_token_embeddings(len(tok))

tok.save_pretrained("your-new-tokenizer")
model.save_pretrained("your-new-model")

Key points from HF docs and issues:

  • add_tokens() adds tokens to the vocab and ensures they are treated as atomic tokens by the tokenizer. Even if they aren’t marked “special”, they won’t get split into smaller BPE pieces. (GitHub)
  • add_special_tokens() is what you don’t want here, because that makes them special and then skip_special_tokens=True will remove them. (Hugging Face)

4.2 Don’t put them in additional_special_tokens

After adding them as normal tokens:

  • Do not do:
tok.add_special_tokens({"additional_special_tokens": ["<think>", "</think>"]})
  • Do not add them as bos_token, eos_token, pad_token, etc.

If they’re not in additional_special_tokens and not assigned to special slots, they will not appear in all_special_tokens, and HF will not tell the fast tokenizer to treat them as special. (Hugging Face)

Result:

  • decode(..., skip_special_tokens=True) will not drop them.
  • They remain visible in decoded text.
  • You can still detect them yourself by string matching or by IDs.

4.3 How to verify

Quick test:

s = "<think> hello </think>"

enc = tok(s, add_special_tokens=False)
print(enc.input_ids, tok.convert_ids_to_tokens(enc.input_ids))

print("skip_special_tokens=False:", tok.decode(enc.input_ids, skip_special_tokens=False))
print("skip_special_tokens=True :", tok.decode(enc.input_ids, skip_special_tokens=True))

If everything is wired correctly, both decodes will still contain <think> and </think>.


5. What about Unsloth (training)?

Unsloth sits on top of Transformers and TRL’s SFTTrainer. Documentation and examples note that SFTTrainer can automatically add chat special tokens like <|im_start|>, <|im_end|> for Qwen-style models. (Hugging Face)

Important points for your case:

  • Unsloth mostly interacts with the tokenizer through the Transformers API:

    • AutoTokenizer.from_pretrained(...)
    • tokenizer.apply_chat_template(...)
    • tokenizer.eos_token, pad_token, etc.
  • If you add <think> / <tool_response> as normal tokens before passing the tokenizer to Unsloth, Unsloth will see them as part of the vocab and can use them in your training data.

  • As long as you don’t add them via add_special_tokens or put them in additional_special_tokens, Unsloth will not treat them as special and they will not be skipped on decode.

So for Unsloth:

  1. Prepare a model + tokenizer where <think> is in vocab but not “special”.
  2. Use that as the base for fine-tuning.
  3. Make sure you push or save both updated model and tokenizer so Unsloth loads the same artifacts.

6. What about vLLM (inference)?

vLLM has its own tokenizer wrapper, but it is still driven by HF artifacts:

  • It uses HF tokenizers (or model-specific tokenizers) underneath. The API exposes decode(ids, skip_special_tokens=...), with skip_special_tokens=None meaning “use backend’s default”. (docs.vllm.ai)
  • Older vLLM versions always skipped special tokens; newer ones make this configurable, but some tokenizers still insist on skip_special_tokens=True (e.g. Mistral in certain versions). (GitHub)
  • vLLM expects tokenizer_config.json / chat templates to be present for chat-style models; missing entries can break compatibility (example: InternVL requiring extra image tokens). (Hugging Face)

For your goal (“do not drop <think>, <tool_response>”):

  • Again, the crucial thing is whether those IDs are registered as special in the underlying tokenizer.

  • If you keep them as normal tokens (not special) in tokenizer.json and never add them as special via Transformers, then:

    • The fast tokenizer will not consider them special.
    • vLLM’s decode(..., skip_special_tokens=True) will not drop them either.

If you want extra safety, you can:

  • Call vLLM’s generation with an explicit parameter to not skip special tokens (some APIs and versions support skip_special_tokens=False, though there have been bugs around this for particular models). (GitHub)
  • Or post-process on the server side by doing your own AutoTokenizer.decode(..., skip_special_tokens=False) on the raw token IDs.

7. Practical checklist for your setup

  1. Restore and keep tokenizer_config.json and special_tokens_map.json.

    • You need them so that all_special_tokens, eos_token, pad_token, etc. remain defined and Unsloth / TRL / vLLM work smoothly. (NVIDIA Docs)
  2. Ensure only the tokens you truly want hidden are “special”:

    • EOS, PAD, chat boundary tokens, vision pad tokens, etc. live in additional_special_tokens and special slots (eos_token, pad_token, 
).
    • Those will be skipped by skip_special_tokens=True.
  3. Add <think>, <tool_response> as non-special:

    • Use tokenizer.add_tokens([...]).
    • Do not include them in additional_special_tokens.
    • Do not assign them to any named special token attribute.
  4. Resize model embeddings after adding tokens.

  5. Save the updated tokenizer & model, and use the same directory for:

    • Unsloth training (from_pretrained).
    • vLLM serving (vllm serve path-or-id).
  6. Verify behavior with a small test:

    • In Transformers: decode with skip_special_tokens=True and False.
    • In vLLM: generate a short response containing <think> and ensure it appears in the streamed output.

Key references

  • HF Transformers tokenizer docs – add_tokens, add_special_tokens, and special-token behavior. (Hugging Face)
  • HF Tokenizers docs – special tokens, skip_special_tokens description. (Hugging Face)
  • StackOverflow answer – “Don’t add as special token if you don’t want decode to strip it; use AddedToken / add_tokens instead.” (Stack Overflow)
  • TRL / SFTTrainer docs – how they update tokenizer and special tokens for Qwen-like chat models. (Hugging Face)
  • vLLM tokenizer API – decode with skip_special_tokens, relying on backend tokenizer semantics. (docs.vllm.ai)
1 Like

is there an interface which can reduce token?
cause I added the new token and resized the embedding layer, the finetuning is not very well, maybe I didnot finetune the new token’s embedding

1 Like

Whether increasing or decreasing tokens, I often see resize_token_embeddings.


You are basically right: if you just “add tokens + resize embeddings” and don’t explicitly make those new rows trainable and well-used in the data, their embeddings can stay bad and hurt performance.

Let me break down both parts of your question:

  1. “Is there an interface which can reduce token?” (shrink back / remove tokens)
  2. “Maybe I didn’t fine-tune the new token’s embedding” (how to fix that, especially with Unsloth + vLLM)

1. Can you “reduce tokens” after adding them?

1.1 What the official API supports

On the model side, Transformers exposes exactly one official hook for changing vocab size:

model.resize_token_embeddings(new_num_tokens)

This works for:

  • Increasing vocab (add rows to embedding + lm_head).
  • Decreasing vocab (drop rows from the end). (Hugging Face)

On the tokenizer side, there is no nice high-level “remove these tokens and reindex everything” API. You can:

  • Add tokens (tokenizer.add_tokens, tokenizer.add_special_tokens). (Hugging Face)
  • Manually edit tokenizer.vocab, tokenizer.encoder, added_tokens_encoder (for some tokenizers) if you really want to hack things, but this is not supported and easy to break. A long-standing GitHub issue on “How to remove token?” is explicitly marked “wontfix”; the suggested workaround is manual dictionary surgery. (GitHub)

So:

  • There is a way to change the size of the embedding matrix (resize_token_embeddings).
  • There is no safe, general interface to “remove arbitrary tokens from the middle of the vocab and renumber everything”.

1.2 What happens if you shrink vocab size?

If you call:

# Suppose tokenizer has grown to 151670 tokens:
model.resize_token_embeddings(new_num_tokens=151640)

then:

  • The model’s input embeddings and output lm_head are re-created with the first 151640 rows.
  • Whatever was in the last rows is simply dropped.

This assumes:

  • The tokens you want to remove are exactly the last ones (which is usually true if you only ever appended new tokens).
  • The tokenizer is updated consistently so that its vocab size is also 151640 and it never emits IDs ≄ 151640.

Even then, shrinking has some risk:

  • There are bug reports where shrinking embeddings caused shape mismatches or CUDA crashes for some architectures (T5, etc.). (GitHub)
  • If you saved checkpoints at the larger size and then load them with a smaller-size model, you must handle the mismatch manually.

There is also a community recipe that shows how to manually remove unwanted tokens and corresponding rows from the embeddings to reduce model size, but it’s advanced, easy to get wrong, and specific to a given model/tokenizer. (Medium)

1.3 Practical answer for your use case

Given that you:

  • Added a small number of tokens (<think>, <tool_response>, etc.).
  • Are not happy with the fine-tuned result.

The simplest safe options are:

  1. Revert to the original base model + original tokenizer and re-finetune without those tokens (if you decide you don’t actually need them as separate IDs).

  2. Or keep the extra tokens, but fix the embedding training, instead of trying to shrink the vocab:

    • That avoids all the brittle “remove tokens” surgery.
    • It’s what most people do when customizing control tokens.

If you really want to “undo” the addition:

  • Restore the original tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) from the base model.
  • Restore the original base model weights (or re-download them).
  • Then start a new fine-tune run from that clean base.

Trying to surgically remove tokens from your current checkpoint is possible but not worth it unless you’re very comfortable doing custom model surgery.


2. Why your new tokens may not be trained (and how to fix it)

Your suspicion is very plausible:

“maybe I did not finetune the new token’s embedding.”

Especially in a LoRA / PEFT / Unsloth setup, this happens a lot.

2.1 How new token embeddings are initialized

Standard Hugging Face behavior:

  1. You add tokens to the tokenizer:

    num_new = tokenizer.add_tokens(["<think>", "</think>", "<tool_response>", "</tool_response>"])
    
  2. You resize embeddings:

    model.resize_token_embeddings(len(tokenizer))
    
  3. Transformers creates a bigger embedding matrix:

    • Old rows are copied over.
    • New rows are initialized randomly (often from a normal distribution). (Hugging Face)

HF forum discussions and docs explicitly say: to make these useful, you must fine-tune so that the new rows get gradients; otherwise they stay near random. (Hugging Face Forums)

Unsloth adds a more advanced initializer: in their resizing code, new embeddings are initialized from a multivariate normal with the mean and covariance of the old embeddings (vocabulary expansion trick from Hewitt’s paper). (GitHub)
That’s better than pure random, but it still needs training to become meaningful.

2.2 Why fine-tuning may not be updating the new embeddings

There are two main failure modes:

(A) The data doesn’t contain the new tokens enough

  • Gradients for a token embedding are only generated when that token ID appears in the input / labels.
  • If <think> appears just a handful of times, its embedding hardly moves.
  • HF forum discussions about “training embeddings of tokens” emphasize that you need enough occurrences, otherwise the embeddings stay poor. (Hugging Face Forums)

You can fix this by:

  • Making sure your training prompts actually contain <think> / <tool_response> many times.
  • Possibly oversampling examples that use them.

(B) You are using LoRA / PEFT and embeddings are frozen

With PEFT / LoRA (which Unsloth uses under the hood):

  • By default, the base model weights (including input embeddings) are frozen.
  • Only adapter parameters (low-rank matrices) are trainable.

If you add new tokens:

  • The new embedding rows live in the base embedding matrix (not in the LoRA adapters).
  • If the embedding module is frozen, those rows never get updated — they stay in their initial distribution.

There is a direct StackOverflow + PEFT answer about this exact point: if you resize token embeddings and only do LoRA, you need to mark embeddings as trainable via modules_to_save=["embed_tokens"] in LoraConfig, otherwise the base embeddings (including new tokens) remain untrained. (Stack Overflow)

With Unsloth specifically:

  • You typically load the model via FastLanguageModel.from_pretrained(...). (Medium)
  • If you add tokens and call model.resize_token_embeddings(len(tokenizer)), the new rows are created in the base embeddings.
  • If your LoRA config doesn’t include embed_tokens (or similar) as a module to save/train, those new rows will not be updated.

So yes: your suspicion is likely correct — in a typical Unsloth LoRA setup, new tokens do not get useful embeddings unless you explicitly configure them to be trainable.

2.3 How to train the new embeddings properly

The general recipe (HF + PEFT):

  1. Add tokens and resize embeddings before creating / patching the PEFT model.

  2. In your LoRA / PEFT config, include embeddings as modules to save/train, for example:

    from peft import LoraConfig
    
    lora_config = LoraConfig(
        r=...,
        lora_alpha=...,
        lora_dropout=...,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        modules_to_save=["embed_tokens"],   # <-- important
    )
    

    This makes the embedding layer trainable and ensures updated embeddings are saved in your PEFT checkpoint. (Stack Overflow)

  3. Make sure your training data actually uses <think> / <tool_response> plenty of times.

  4. Train for enough steps so those embeddings converge.

Unsloth has helpers like add_new_tokens(model, tokenizer, new_tokens=...) that integrate resizing and initialization; you still need to ensure LoRA config allows embeddings to update. (GitHub)


3. How vLLM fits into this

You mentioned:

“I use vllm to inference after finetuning, I don’t know which it use, either”

vLLM’s documentation is clear:

  • Tokenizer: vLLM simply uses the Hugging Face tokenizer loaded with AutoTokenizer. (docs.vllm.ai)
  • Model weights: it loads standard HF model checkpoints (PyTorch / safetensors).

So:

  • Whatever tokenizer + vocab + embeddings you used during Unsloth fine-tuning must be the same ones you ship to vLLM.
  • There is no separate “vLLM-only interface” to reduce tokens; vLLM just respects whatever vocab and token IDs exist in the Hugging Face artifacts you give it.

If the new tokens have bad embeddings, vLLM will produce bad generations involving them, exactly like Transformers would.


4. What I would do in your situation

Given your notes, here is a concrete plan:

Option 1 – Keep the new tokens, fix training (recommended if you need them)

  1. Start from the original base model again (or from your current, but the cleanest is the base).
  2. Add your custom tokens (<think>, <tool_response>, etc.) using tokenizer.add_tokens (not add_special_tokens) and model.resize_token_embeddings(len(tokenizer)).
  3. In Unsloth / PEFT, configure LoRA with modules_to_save=["embed_tokens"] (and optionally include embed_tokens in target_modules if needed). (Stack Overflow)
  4. Make sure your dataset uses the new tokens very frequently.
  5. Fine-tune again.
  6. Export model + tokenizer together, and point vLLM at that directory.

Option 2 – Decide you don’t need the new tokens and revert

  1. Restore the original tokenizer (tokenizer.json, tokenizer_config.json, special_tokens_map.json) from the base model. (stephendiehl.com)
  2. Restore the original base checkpoint.
  3. Treat <think> and <tool_response> as multi-token strings instead of new IDs (just literal text in the prompt), accepting slightly longer sequences.
  4. Fine-tune without changing vocab.

This avoids all embedding and resizing complexity.

Option 3 – Advanced: actually shrink embeddings

Only if you are comfortable doing low-level surgery:

  • Confirm that the tokens you added are exactly at the end of the vocab.
  • Adjust tokenizer so those tokens are no longer present (or rebuild a tokenizer without them).
  • Call model.resize_token_embeddings(new_smaller_vocab_size) to drop the trailing rows.
  • Ensure lm_head and any tied embeddings are shrunk consistently. (Medium)

This is doable but fragile, and usually not necessary unless you are chasing model size reductions for deployment.


Summary

  • There is no clean, general “remove token” API. You only have model.resize_token_embeddings(new_num_tokens) and manual tokenizer surgery; shrinking is possible but brittle and only removes rows at the end. (Hugging Face)

  • Your poor fine-tuning is very likely because the new token embeddings were never really trained:

    • New rows are random/mean-initialized.
    • LoRA / PEFT often keeps embeddings frozen unless you explicitly mark them as trainable (modules_to_save=["embed_tokens"]).
    • If the new tokens don’t appear frequently in the training data, their embeddings get almost no gradient. (Hugging Face Forums)
  • vLLM simply uses the same Hugging Face tokenizer and model you trained, so any bad embeddings will show up in its outputs too. (docs.vllm.ai)

1 Like