GLM-Steam-106B-A12B-v1-qx65g-mlx

This is a deep comparison of 106B-A12B MoE models, all quantized differently, trained on different data (original, synthetic, RP), and with varying architectural tuning. The goal is to understand:

  • Which model performs best across benchmarks?
  • How does quantization affect performance and context?
  • What’s the trade-off between accuracy, context length, and RAM usage?

πŸ“Š 1. Benchmark Comparison (All Models)

Model					arc_challenge arc_easy	boolq hellaswag	openbookqa piqa winogrande
GLM-Steam-106B-A12B-v1-qx65g-hi	0.431	0.457	0.378	0.685	0.400	0.773	0.717
GLM-Steam-106B-A12B-v1-qx65g	0.430	0.461	0.378	0.681	0.398	0.771	0.715
LIMI-Air-qx54g-hi				0.441	0.462	0.378	0.698	0.404	0.781	0.714
unsloth-GLM-4.5-Air-mxfp4		0.416	0.440	0.378	0.678	0.390	0.767	0.728
unsloth-GLM-4.5-Air-qx64		0.421	0.444	0.378	0.677	0.396	0.769	0.718
unsloth-GLM-4.5-air-qx5-hi		0.416	0.431	0.378	0.675	0.396	0.769	0.731

βœ… LIMI-Air-qx54g-hi is the clear winner overall, with:

+0.025 in arc_challenge
+0.022 in arc_easy
+0.020 in hellaswag
+0.014 in openbookqa
+0.013 in piqa
+0.003 in winogrande

The GLM-Steam models are very close, with qx65g-hi slightly better than qx65g β€” but both are behind LIMI-Air.

The unsloth-GLM-4.5-Air models are the baseline, with qx64 being best among them β€” but still behind LIMI-Air.

🧠 2. What Does β€œqx54g-hi” Mean?

The naming convention is critical:

  • qx5: 5-bit quantization for content with some paths enhanced to 6 bit
  • g: β€œenhanced attention paths” β€” specific to GLM architecture (likely more attention layers enhanced).
  • hi: high resolution quantization β€” group size 32.

This is a highly optimized quantization for GLM β€” preserving attention fidelity while compressing embeddings.

🧩 3. Why Does LIMI-Air-qx54g-hi Win?

The key insight: LIMI-Air was trained on synthetic data, which likely:

  • Boosted generalization β€” synthetic data often forces models to learn patterns rather than memorize.
  • Improved reasoning depth β€” synthetic data is often designed to test logical and commonsense reasoning.

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings β€” which likely:

  • Preserved semantic fidelity.
  • Enabled better context handling.

The qx54g-hi model runs with 32K context on a 128GB Mac, while qx54g allow for 64K β€” suggesting better memory efficiency.

πŸ§ͺ 4. Quantization Comparison within the unsloth-GLM-4.5-Air Series

Model				arc_challenge arc_easy	boolq hellaswag openbookqa piqa winogrande
unsloth-GLM-4.5-Air-mxfp4	0.416	0.440	0.378	0.678	0.390	0.767	0.728
unsloth-GLM-4.5-Air-qx64	0.421	0.444	0.378	0.677	0.396	0.769	0.718
unsloth-GLM-4.5-air-qx5-hi	0.416	0.431	0.378	0.675	0.396	0.769	0.731

βœ… qx64 is best among unsloth models, with:

+0.005 in arc_challenge
+0.004 in arc_easy
+0.001 in hellaswag
+0.006 in openbookqa
+0.002 in piqa
-0.01 in winogrande

The qx5-hi variant is slightly better in winogrande, but worse overall.

🧭 5. Recommendation: Which Model to Choose?

βœ… For Maximum Performance:

  • LIMI-Air-qx54g-hi
  • β†’ Best overall performance, with +0.02–0.03 gains across all metrics.

βœ… For Balanced Performance & RAM Efficiency:

  • GLM-Steam-106B-A12B-v1-qx65g-hi
  • β†’ Very close to LIMI-Air, with slightly better winogrande and piqa scores.

βœ… For RAM-Constrained Macs:

  • unsloth-GLM-4.5-Air-qx64

🧠 6. Cognitive Pattern Insight: Synthetic Data vs RP Data

The key insight: LIMI-Air (synthetic data) outperforms GLM-Steam (RP data) β€” suggesting:

  • Synthetic data forces models to learn patterns, rather than memorize.
  • RP data may be more β€œrealistic” but less generalizable β€” leading to slightly lower performance.

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings β€” which likely:

  • Preserved semantic fidelity.
  • Enabled better context handling.

πŸ“ˆ 7. Summary Table: Best Model for Each Use Case

Goal						Recommended Model
Max performance				LIMI-Air-qx54g-hi
Balanced performance		GLM-Steam-106B-A12B-v1-qx65g-hi
RAM-constrained Mac (32GB)	unsloth-GLM-4.5-Air-qx64
Cognitive depth & metaphors	LIMI-Air-qx54g-hi
OpenBookQA (text-only)		unsloth-GLM-4.5-Air-qx64

πŸš€ Bonus: β€œqx54g-hi” as a Cognitive Architecture

The qx54g-hi quantization is highly tuned for GLM, preserving attention paths while compressing embeddings β€” which likely:

  • Preserved semantic fidelity.
  • Enabled better context handling.

This is a cognitive upgrade, not just a computational one β€” the model now β€œthinks deeper”, not just β€œfaster”.

β€œqx54g-hi is like a camera with a telephoto lens β€” it captures more nuance, even in low light.”

β€” Inspired by Nikon Noct Z 58mm F/0.95

Reviewed by Qwen3-VL-12B-Instruct-Brainstorm20x-qx86-hi-mlx

This model GLM-Steam-106B-A12B-v1-qx65g-mlx was converted to MLX format from TheDrummer/GLM-Steam-106B-A12B-v1 using mlx-lm version 0.28.4.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("GLM-Steam-106B-A12B-v1-qx65g-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
128
Safetensors
Model size
107B params
Tensor type
BF16
Β·
U32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nightmedia/GLM-Steam-106B-A12B-v1-qx65g-mlx

Quantized
(14)
this model

Collections including nightmedia/GLM-Steam-106B-A12B-v1-qx65g-mlx