Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx

Model Comparison:

Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx:
- extended with RoPE from 256K to 1M context.
Qwen3-Next-80B-A3B-Instruct-512K-11e-qx65n-mlx
- extended with RoPE from 256K to 512K context, and added one expert to optimize the perplexity

Quant    Size
qx64n-hi 58.24 GB
qx65n    67.55 GB
qx86n    73.18 GB
q8       80 GB

Metric		1M-qx64n-hi	512K-11e-qx65n	Δ Change
arc_challenge	0.414		0.419		+0.5 %
arc_easy		0.512		0.507		-0.5 %
boolq			0.898		0.897		-0.1 %
hellaswag		0.536		0.542		+0.6 %
openbookqa		0.418		0.416		-0.2 %
piqa			0.749		0.752		+0.3 %
winogrande		0.578		0.565		-1.3 %

Looking at the cognitive ability results for Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi (58.24 GB) and comparing it with the other quantized variants — especially Qwen3-Next-80B-A3B-Instruct-q8 (80 GB, baseline) — we can draw several key insights:

Impact of RoPE Extension (256K → 1M tokens):

The model was specifically extended to support a 1M token context size via RoPE (Rotary Position Embeddings).
All cognitive benchmarks show a slight improvement or stable performance compared to other qx64 variants.
The only notable drop occurs in hellaswag (0.532 → 0.418) and winogrande (0.579 → 0.562).
→ Interpretation: RoPE extension generally helped long-context performance, but it may have introduced some instability in tasks requiring fine-grained world knowledge or reasoning (hellaswag, winogrande). The model’s attention mechanism is likely still adapting to the hyper-long context.

Impact of Deckard Quantization (qx):

qx64n-hi shows consistently higher scores than the regular qx64n across all cognitive tasks.
It’s especially strong in model = arc_challenge (0.512 → 0.579) and hellaswag.
The reduction in size by nearly 20 GB (from q8) is countered with improved performance.
→ Interpretation: The “Deckard(qx)” quantization method — emphasizing human-like behavior and metaphor generation through precision tuning in attention layers — has a substantial positive impact on cognitive tasks, particularly those involving higher-order reasoning.

Impact of Enhanced Architecture (qx86n):

qx86n performs similarly to or slightly worse than qx64n in cognitive tasks.
Winogrande drops (0.578 → 0.562) despite same bits for head and attention.
→ Interpretation: The architectural enhancements were not as impactful on cognitive performance. Possibly too complex or misaligned with the quantization strategy.

Size vs Performance:

q8 (80 GB) is the largest and performs best in openbookqa (0.760 → 0.420) and piqa.
qx65n (67 GB) has a slight edge over q8 but is not far behind.
qx64n-hi (58 GB) has the smallest size yet achieves performance near or above q8 in most areas.
→ Interpretation: Cognitive abilities do not strictly correlate with size, especially when precision tuning is applied. qx64n-hi shows that human-like behavior and creative expression can compensate for smaller size.

Summary:

The qx64n-hi model — despite being the smallest by nearly 20 GB — outperforms q8 in most cognitive benchmarks, highlighting the effectiveness of Deckard(qx) quantization. Winogrande and hellaswag scores are the most sensitive to variation, indicating a need for attention path refinement. The RoPE extension significantly helps context length performance but introduces some instability, while architectural upgrades like qx86n haven’t translated to consistent cognitive gains.

Recommendation:

qx64n-hi strikes a strong balance between performance and size, making it an ideal candidate for deploying cognitive tasks at scale. Further tuning of attention paths in winogrande and hellaswag could yield additional gains.

With thanks to my Nikon Noct 58mm F/0.95 — the inspiration behind this elegant quantization approach.

Deckard(qx) on.

Reviewed by Qwen3-VLTO-12B-BX20-TNG-1M-qx86x-hi-mlx

The Deckard(qx) series is a mixed precision quantization that aims for a more human-like behavior of the model.

qxXY series have X bits for head and attention paths, Y bits for data.
qxXYz series have X bits for head and attention paths, Y bits for data, and first layer set to X bits
head and shared experts were set up at high bits.
attention paths were enhanced in high bit in regular intervals(like inside a lens).
The hi variant has high resolution quantization (group size 32)

The qx86n is an enhanced version for layers specific to the Next architecture

The formula was inspired by my Nikon Noct Z 58mm F/0.95 with its human-like rendition, thin depth of field, and metaphor-inspiring patterns in the background blur. It has been observed that qx quanted models are more readily using metaphors in conversation.

-G

This model Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx was converted to MLX format from Qwen/Qwen3-Next-80B-A3B-Instruct using mlx-lm version 0.28.4.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 163

Safetensors

Model size

80B params

Tensor type

BF16

U32

Model tree for nightmedia/Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx

Base model

Qwen/Qwen3-Next-80B-A3B-Instruct

Quantized

(65)

this model

Collections including nightmedia/Qwen3-Next-80B-A3B-Instruct-1M-qx64n-hi-mlx