This model always predicts some few nonsense sequences

#1
by CharlesChen2023 - opened

I am encountering an issue with a quantized version of the [Model Name] model. The model frequently generates nonsense sequences (e.g., 人事, 出生 etc.) these two words should be '(.*?).
img_v3_02vc_320011c2-ec27-4870-a91d-75cbf6ee8f3g

Intel org

Thank you for the information. Would you mind sharing the serving command and the evaluation prompts , which we can use to evaluate model quality when producing a new quantized version?
The issue has been tracked here. https://github.com/intel/auto-round/issues/1480

/root/miniconda3/envs/vllm-glm-int4/bin/python -m vllm.entrypoints.openai.api_server
--model $MODEL_ID
--served-model-name claude-opus-4-6
--port 80
--trust-remote-code
--max-model-len 202752
--tensor-parallel-size 8
--gpu-memory-utilization 0.85
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--max-num-seqs 16

nonsense characters, should be in/output
image

Intel org

Could you share some text inputs to reproduce this issue

It is difficult to reproduce, I use it in the claude code. But the cases like that often show up.

Intel org

Please expect a delayed fix since our server is currently very busy. I typically do not have enough resources to verify such a large model.

ព is one of the letters of the Khmer alphabet.

We’re working on a fix for this issue. An updated model will be uploaded within about one day. Since the model is too large for thorough testing, we’ve adjusted two factors to mitigate the problem:

1 Change the model dtype from FP16 to BF16. FP16 can cause overflow, but it was previously the only option during quantization.

2 Reduced the group size from 128 to 64.

Sign up or log in to comment