This model always predicts some few nonsense sequences
Thank you for the information. Would you mind sharing the serving command and the evaluation prompts , which we can use to evaluate model quality when producing a new quantized version?
The issue has been tracked here. https://github.com/intel/auto-round/issues/1480
/root/miniconda3/envs/vllm-glm-int4/bin/python -m vllm.entrypoints.openai.api_server
--model $MODEL_ID
--served-model-name claude-opus-4-6
--port 80
--trust-remote-code
--max-model-len 202752
--tensor-parallel-size 8
--gpu-memory-utilization 0.85
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--max-num-seqs 16
Could you share some text inputs to reproduce this issue
It is difficult to reproduce, I use it in the claude code. But the cases like that often show up.
Please expect a delayed fix since our server is currently very busy. I typically do not have enough resources to verify such a large model.
ព is one of the letters of the Khmer alphabet.
We’re working on a fix for this issue. An updated model will be uploaded within about one day. Since the model is too large for thorough testing, we’ve adjusted two factors to mitigate the problem:
1 Change the model dtype from FP16 to BF16. FP16 can cause overflow, but it was previously the only option during quantization.
2 Reduced the group size from 128 to 64.

