Original model is just 600gb

by jpsequeira - opened Jan 28

Discussion

jpsequeira

Jan 28

Is there not a way to keep Q8_0 on para with that value?

danielhanchen

Unsloth AI org Jan 28

Is there not a way to keep Q8_0 on para with that value?

No, it's just how llamacpp works. If you want to run near full precision, use the Q4K_XL quant. Its like ~98% there.

aquiffoo

Jan 28

the model was 600GB because since Kimi K2 Thinking, moonshot has been shipping their models on INT4 quantization

jpsequeira

Jan 28

Do you guys think it's possible for the llama.cpp guys to provide more "apples to apples" size parity?

aquiffoo

Jan 28

no...

models store weight data in bits. full precision is 32 bits, and the most common is 16 bits. if a byte is 8 bits, then a 16-bit model will weigh 2x the amount of parameters.

by that logic, an 8-bit model weighs the same amount of parameters in a model, and 4-bit is half the amount of parameters a model has.

thus, there isn't really a way to compress weights other than reducing the weight of the amount of data they have. you could prune the model with REAP, but that wouldn't be the same model. you could do tricks at inference time to make it faster but it will ultimately be this size for Q8_0

mratsim

Jan 28

•

edited Jan 28

full precision is 32 bits

No one trains model in 32-bit so full precision is BF16, except for models where native quant is 8-bit (DeepSeek, MiniMax) or 4-bit (gpt-oss, Kimi K2).

I don't think it makes sense to provide weights above Q4 for a native int4 model, it leads to confusion and people expecting higher quality from say Q5 that cannot be recovered. It's like encoding 1.10 instead of 1.1

aquiffoo

Jan 28

No one trains model in 32-bit so full precision is BF16

full precision is indeed 32 bits but yes, it's much more common to use 16-bit, or half/mixed-precision, instead of full precision

PhenotypeNine

about 1 month ago

•

edited about 1 month ago

It would be appreciated (and less confusing) to stop listing quant variants after they exceed the quality of the original. pretty sure doing so is in contradiction to the very idea of quantization?

Edit: My bad. I see you have in the model card "To run the model in full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe."

jpsequeira

about 1 month ago

Is this like a Q3 or even Q2 gets you a theoretical quality of a Q4?

Aleks1987

about 1 month ago

+1. The original model weighs in at 600GB. Did you get the uncompressed 2TB BF16 model from somewhere? If you're using the same model, then there's no point in downloading any quantization option higher than Q4K_XL.

AesSedai

29 days ago

llama.cpp's convert_hf_to_gguf will decompress the INT4 compressed-tensors into a 2.1TB BF16. At that point though, you don't want to quantize the conditional experts any higher than Q4_0, since that is wasted bits.

I've produced a Q4_X quant (following @ubergarm 's naming convention) for Kimi-K2.5 that is "full fidelity" - Q8_0 as the default quantization type, and Q4_0 for the experts, and it's 583GB. That's as close as you're going to get to a true minimally-quantized model.

jpsequeira

27 days ago

Does the model being trained in INT4 make it extremely sensible to quantization? 583GB is still a lot for me.

ubergarm

26 days ago

@AesSedai

Thanks so much for releasing the "full quality" Q4_X, I'll send all my usual customers over there given my life has been too crazy this past week and I won't be fully back in the saddle until end of this week.

Appreciate you so much!

PhenotypeNine

26 days ago

I would think the fewer bits per weight to work with, the more degradation quantization would cause (not an expert on this)

ubergarm

23 days ago

@PhenotypeNine

I would think the fewer bits per weight to work with, the more degradation quantization would cause (not an expert on this)

In general this is typically the trend, however if QAT was used on the original model it is important to match that target quantization scheme as closely as possible which is what AesSedai's Q4_X does specific to this model.

I have some more discussion on the exact llm-compressor config here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/discussions/9#6984ae704490533d25a3aa6e

I have a more general talk on quantization in general which you might like here: https://blog.aifoundry.org/p/adventures-in-model-quantization

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment