Original model is just 600gb
Is there not a way to keep Q8_0 on para with that value?
Is there not a way to keep Q8_0 on para with that value?
No, it's just how llamacpp works. If you want to run near full precision, use the Q4K_XL quant. Its like ~98% there.
the model was 600GB because since Kimi K2 Thinking, moonshot has been shipping their models on INT4 quantization
Do you guys think it's possible for the llama.cpp guys to provide more "apples to apples" size parity?
no...
models store weight data in bits. full precision is 32 bits, and the most common is 16 bits. if a byte is 8 bits, then a 16-bit model will weigh 2x the amount of parameters.
by that logic, an 8-bit model weighs the same amount of parameters in a model, and 4-bit is half the amount of parameters a model has.
thus, there isn't really a way to compress weights other than reducing the weight of the amount of data they have. you could prune the model with REAP, but that wouldn't be the same model. you could do tricks at inference time to make it faster but it will ultimately be this size for Q8_0
full precision is 32 bits
No one trains model in 32-bit so full precision is BF16, except for models where native quant is 8-bit (DeepSeek, MiniMax) or 4-bit (gpt-oss, Kimi K2).
I don't think it makes sense to provide weights above Q4 for a native int4 model, it leads to confusion and people expecting higher quality from say Q5 that cannot be recovered. It's like encoding 1.10 instead of 1.1
No one trains model in 32-bit so full precision is BF16
full precision is indeed 32 bits but yes, it's much more common to use 16-bit, or half/mixed-precision, instead of full precision
It would be appreciated (and less confusing) to stop listing quant variants after they exceed the quality of the original. pretty sure doing so is in contradiction to the very idea of quantization?
Edit: My bad. I see you have in the model card "To run the model in full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe."
Is this like a Q3 or even Q2 gets you a theoretical quality of a Q4?
+1. The original model weighs in at 600GB. Did you get the uncompressed 2TB BF16 model from somewhere? If you're using the same model, then there's no point in downloading any quantization option higher than Q4K_XL.
llama.cpp's convert_hf_to_gguf will decompress the INT4 compressed-tensors into a 2.1TB BF16. At that point though, you don't want to quantize the conditional experts any higher than Q4_0, since that is wasted bits.
I've produced a Q4_X quant (following @ubergarm 's naming convention) for Kimi-K2.5 that is "full fidelity" - Q8_0 as the default quantization type, and Q4_0 for the experts, and it's 583GB. That's as close as you're going to get to a true minimally-quantized model.
Does the model being trained in INT4 make it extremely sensible to quantization? 583GB is still a lot for me.
I would think the fewer bits per weight to work with, the more degradation quantization would cause (not an expert on this)
I would think the fewer bits per weight to work with, the more degradation quantization would cause (not an expert on this)
In general this is typically the trend, however if QAT was used on the original model it is important to match that target quantization scheme as closely as possible which is what AesSedai's Q4_X does specific to this model.
I have some more discussion on the exact llm-compressor config here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/discussions/9#6984ae704490533d25a3aa6e
I have a more general talk on quantization in general which you might like here: https://blog.aifoundry.org/p/adventures-in-model-quantization
Cheers!