DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF)

Science-calibrated 4‑bit GGUF of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, using an importance matrix computed on the MetaMathQA mathematical reasoning dataset.

This repository provides:

  • A Q4_K_M GGUF of a 70B reasoning model, suitable for llama.cpp and compatible runtimes.
  • The science-domain importance matrix and the MetaMathQA‑based calibration / evaluation text used to drive quantization decisions.
  • Simple perplexity and throughput benchmarks vs. the original BF16 GGUF and a generic (WikiText‑calibrated) Q4_K_M model.

Code & scripts: https://github.com/phangzs/science-calibrated-llm-quantization


1. Model Summary

  • Base model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B – a 70B Llama‑architecture reasoning model distilled from DeepSeek‑R1, with strong math, coding, and general reasoning performance and a 131k token context window. :contentReference[oaicite:0]{index=0}
  • Format: GGUF v3, Q4_K_M (4‑bit K‑quant, “Medium”).
  • Intended runtime: llama.cpp (CUDA), LM Studio, Ollama (after import), and other GGUF‑compatible engines.
  • Domain focus: Math / quantitative reasoning, especially GSM8K/MATH‑style problems with chain‑of‑thought.
  • Goal: Preserve BF16 perplexity on math‑heavy text while significantly reducing memory footprint and increasing throughput.

This model does not change the base model’s instruction format or tokenizer; it only changes the weight representation.


2. Files in this repository

Exact filenames may differ slightly depending on how you organized things, but this is the intended structure.

  • DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf
    → The science-calibrated Q4_K_M GGUF; this is the main model file you load in llama.cpp.

  • imatrix_science.dat
    → Legacy (.dat) importance matrix produced by llama-imatrix on MetaMathQA text. This is not needed at inference time, but is included for transparency and reproducibility.

  • science_calibration.txt
    → Plaintext calibration corpus used by llama-imatrix to build imatrix_science.dat. Each line is a math question + chain‑of‑thought + answer sampled from MetaMathQA.

  • eval_science.txt
    → Plaintext evaluation corpus (subset of MetaMathQA) used with llama-perplexity to measure perplexity for BF16 and Q4_K_M models.

  • results/ (optional)

    • ppl_bf16.txt
    • ppl_q4_generic.txt
    • ppl_q4_science.txt
    • bench_bf16.txt
    • bench_q4_generic.txt
    • bench_q4_science.txt
      → Raw logs from llama-perplexity and llama-bench.

3. Quantization & Calibration Details

3.1 Base conversion (HF → BF16 GGUF)

  1. Downloaded the base model:

    huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
      --local-dir models/DeepSeek-R1-Distill-Llama-70B \
      --local-dir-use-symlinks False
    
  2. Converted to BF16 GGUF using llama.cpp’s conversion script:

    python3 convert_hf_to_gguf.py \
      --outtype bf16 \
      --outfile models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
      models/DeepSeek-R1-Distill-Llama-70B
    

The BF16 GGUF is used as the source of truth for both generic and science quantizations.


3.2 Calibration datasets

Generic calibration (for comparison):

  • Source: WikiText‑2 (raw) via Salesforce/wikitext (wikitext-2-raw-v1 split). (Hugging Face)

  • Used to compute a “generic” imatrix and a generic Q4_K_M model (hosted separately or referenced in this card).

Science calibration (this model):

  • Source: meta-math/MetaMathQA. (Hugging Face)

    • MetaMathQA is a large mathematical reasoning dataset derived from GSM8K and MATH training sets, containing question–chain‑of‑thought–answer triples designed to teach math reasoning. (arXiv)
  • A random subset of MetaMathQA examples was formatted as:

    Question: ...
    Reasoning: ...
    Answer: ...
    

    and concatenated into science_calibration.txt. The emphasis is on long, step‑by‑step solutions, not just final answers.


3.3 Importance matrix (imatrix) computation

The importance matrix is computed using llama.cpp’s llama-imatrix example, which estimates per‑weight importance via forward passes over calibration text.

Command (run from the project root; paths are relative):

./llama.cpp/build/bin/llama-imatrix \
  -m models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
  -f data/science_calibration.txt \
  --chunk 512 \
  -ngl 40 \
  --save-frequency 50 \
  -o imatrix/imatrix_science.dat

Key settings:

  • --chunk 512
    → Process 512‑token chunks; balances VRAM use and throughput.

  • -ngl 40
    → Offload 40 transformer layers to GPU on an A100 80GB.

  • --save-frequency 50
    → Periodically saves partial imatrix data to avoid losing progress on long runs.

Note: On this model, llama-imatrix initially encountered an inf detected in blk.3.attn_q.weight check. For this project, the check was modified to clamp non‑finite entries to 0 instead of aborting, allowing imatrix computation to complete. This does not affect inference; it only affects the imatrix statistics used during quantization.


3.4 Q4_K_M quantization

The science‑calibrated Q4_K_M GGUF was produced with:

./llama.cpp/build/bin/llama-quantize \
  --imatrix imatrix/imatrix_science.dat \
  models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
  models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
  Q4_K_M

A separate generic Q4_K_M model was produced similarly, but with the generic imatrix and calibration text (WikiText‑2).


4. Evaluation

Perplexity and throughput were measured using llama-perplexity and llama-bench from llama.cpp (commit ec7f3ac9a, build 4514).

4.1 Perplexity on math reasoning text

Evaluation corpus: eval_science.txt, a held‑out subset of the same MetaMathQA distribution (math word problems with chain‑of‑thought and answers).

Measured with:

./llama.cpp/build/bin/llama-perplexity \
  -m <MODEL> \
  -f data/eval_science.txt \
  --chunks -1 \
  -ngl 40

Results (lower is better):

Model Type PPL (↓) Std. error
BF16 baseline BF16 GGUF 9.74 ± 0.29
Q4_K_M (generic imatrix) Q4_K_M + Wiki 9.71 ± 0.29
Q4_K_M (science imatrix) Q4_K_M + MetaMathQA 9.77 ± 0.29

Interpretation:

  • All three models are within ~0.5% of each other and well inside the ±3% statistical uncertainty of this experiment.

  • In this Q4_K_M regime, domain‑specific imatrix calibration preserves math‑domain perplexity extremely well but does not yield a large, statistically clear improvement over generic calibration on this particular eval slice.


4.2 Throughput on A100 80GB

Prompt‑processing throughput was measured during the perplexity runs (more representative than synthetic microbenchmarks):

  • Command: llama-perplexity ... -ngl 40

  • Context size: 512, batch size 2048, n_seq=4.

Approximate prompt‑eval speeds:

Model Type Tokens/sec (↑)
BF16 baseline BF16 GGUF ~73 t/s
Q4_K_M (generic imatrix) Q4_K_M + Wiki ~164 t/s
Q4_K_M (science imatrix) Q4_K_M + MetaMathQA ~152 t/s

Interpretation:

  • Q4_K_M achieves roughly 2× faster prompt processing than BF16 on an A100 80GB with -ngl 40.

  • Science‑ and generic‑calibrated Q4_K_M models are similar in speed; minor differences are within normal run‑to‑run variance.


5. Usage

5.1 llama.cpp (CLI)

./llama.cpp/build/bin/llama-cli \
  -m DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
  -ngl 40 \
  -c 4096 \
  -n 256 \
  -p "Solve the following problem step by step.\n\nQuestion: A tank contains 120 liters of water. It is drained at a rate of 5 liters per minute. How long does it take to empty the tank?"

You can adjust:

  • -ngl depending on your GPU memory.

  • -c for context length.

  • -n for number of generated tokens.

5.2 Other runtimes

This is a plain GGUF model; it should work in:

  • LM Studio

  • Ollama (after import)

  • Other GGUF‑aware serving frameworks

as long as they support Q4_K_M and the Llama 70B architecture.


6. Intended Use & Limitations

  • Intended use:

    • Local experimentation with 70B‑scale reasoning on math / quantitative tasks.

    • Research on quantization, importance matrices, and math reasoning.

    • Educational demos of domain‑calibrated quantization.

  • Not intended for:

    • Safety‑critical systems.

    • Deployment without additional filtering / moderation.

    • Use cases where correctness of every answer is guaranteed.

Safety & behavior

DeepSeek‑R1‑family models are known to be strong reasoners but not strongly safety‑tuned by default, and they can be vulnerable to jailbreak / prompt‑injection attacks. (WIRED)

This quantized variant inherits those behaviors. If you deploy it in user‑facing settings, you should add your own safety layers (prompting, filtering, or external moderation).


7. Acknowledgements

  • Base model: DeepSeek team for DeepSeek-R1-Distill-Llama-70B. (Hugging Face)

  • Math dataset: MetaMath team for the MetaMathQA dataset and associated paper. (arXiv)

  • Tooling: llama.cpp contributors for the GGUF format, quantization, and imatrix tooling.


8. Citation

If you use this model in academic work, please consider citing:

  • MetaMath / MetaMathQA (for the dataset). (arXiv)

  • DeepSeek‑R1 distillation work (for the base model). (Hugging Face)

You may additionally cite this model card as:

“DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF), Hugging Face model card, 2025.”

Downloads last month
20
GGUF
Model size
71B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

Quantized
(62)
this model

Dataset used to train ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF