DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF)

Science-calibrated 4‑bit GGUF of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, using an importance matrix computed on the MetaMathQA mathematical reasoning dataset.

This repository provides:

A Q4_K_M GGUF of a 70B reasoning model, suitable for llama.cpp and compatible runtimes.
The science-domain importance matrix and the MetaMathQA‑based calibration / evaluation text used to drive quantization decisions.
Simple perplexity and throughput benchmarks vs. the original BF16 GGUF and a generic (WikiText‑calibrated) Q4_K_M model.

Code & scripts: https://github.com/phangzs/science-calibrated-llm-quantization

1. Model Summary

Base model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B – a 70B Llama‑architecture reasoning model distilled from DeepSeek‑R1, with strong math, coding, and general reasoning performance and a 131k token context window. :contentReference[oaicite:0]{index=0}
Format: GGUF v3, Q4_K_M (4‑bit K‑quant, “Medium”).
Intended runtime: llama.cpp (CUDA), LM Studio, Ollama (after import), and other GGUF‑compatible engines.
Domain focus: Math / quantitative reasoning, especially GSM8K/MATH‑style problems with chain‑of‑thought.
Goal: Preserve BF16 perplexity on math‑heavy text while significantly reducing memory footprint and increasing throughput.

This model does not change the base model’s instruction format or tokenizer; it only changes the weight representation.

2. Files in this repository

Exact filenames may differ slightly depending on how you organized things, but this is the intended structure.

DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf
→ The science-calibrated Q4_K_M GGUF; this is the main model file you load in llama.cpp.
imatrix_science.dat
→ Legacy (.dat) importance matrix produced by llama-imatrix on MetaMathQA text. This is not needed at inference time, but is included for transparency and reproducibility.
science_calibration.txt
→ Plaintext calibration corpus used by llama-imatrix to build imatrix_science.dat. Each line is a math question + chain‑of‑thought + answer sampled from MetaMathQA.
eval_science.txt
→ Plaintext evaluation corpus (subset of MetaMathQA) used with llama-perplexity to measure perplexity for BF16 and Q4_K_M models.
results/ (optional)
- ppl_bf16.txt
- ppl_q4_generic.txt
- ppl_q4_science.txt
- bench_bf16.txt
- bench_q4_generic.txt
- bench_q4_science.txt
  → Raw logs from llama-perplexity and llama-bench.

3. Quantization & Calibration Details

3.1 Base conversion (HF → BF16 GGUF)

Downloaded the base model:

huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --local-dir models/DeepSeek-R1-Distill-Llama-70B \
  --local-dir-use-symlinks False

Converted to BF16 GGUF using llama.cpp’s conversion script:

python3 convert_hf_to_gguf.py \
  --outtype bf16 \
  --outfile models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
  models/DeepSeek-R1-Distill-Llama-70B

The BF16 GGUF is used as the source of truth for both generic and science quantizations.

3.2 Calibration datasets

Generic calibration (for comparison):

Source: WikiText‑2 (raw) via Salesforce/wikitext (wikitext-2-raw-v1 split). (Hugging Face)
Used to compute a “generic” imatrix and a generic Q4_K_M model (hosted separately or referenced in this card).

Science calibration (this model):

Source: meta-math/MetaMathQA. (Hugging Face)
- MetaMathQA is a large mathematical reasoning dataset derived from GSM8K and MATH training sets, containing question–chain‑of‑thought–answer triples designed to teach math reasoning. (arXiv)
A random subset of MetaMathQA examples was formatted as:
```
Question: ...
Reasoning: ...
Answer: ...
```
and concatenated into science_calibration.txt. The emphasis is on long, step‑by‑step solutions, not just final answers.

3.3 Importance matrix (imatrix) computation

The importance matrix is computed using llama.cpp’s llama-imatrix example, which estimates per‑weight importance via forward passes over calibration text.

Command (run from the project root; paths are relative):

./llama.cpp/build/bin/llama-imatrix \
  -m models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
  -f data/science_calibration.txt \
  --chunk 512 \
  -ngl 40 \
  --save-frequency 50 \
  -o imatrix/imatrix_science.dat

Key settings:

--chunk 512
→ Process 512‑token chunks; balances VRAM use and throughput.
-ngl 40
→ Offload 40 transformer layers to GPU on an A100 80GB.
--save-frequency 50
→ Periodically saves partial imatrix data to avoid losing progress on long runs.

Note: On this model, llama-imatrix initially encountered an inf detected in blk.3.attn_q.weight check. For this project, the check was modified to clamp non‑finite entries to 0 instead of aborting, allowing imatrix computation to complete. This does not affect inference; it only affects the imatrix statistics used during quantization.

3.4 Q4_K_M quantization

The science‑calibrated Q4_K_M GGUF was produced with:

./llama.cpp/build/bin/llama-quantize \
  --imatrix imatrix/imatrix_science.dat \
  models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
  models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
  Q4_K_M

A separate generic Q4_K_M model was produced similarly, but with the generic imatrix and calibration text (WikiText‑2).

4. Evaluation

Perplexity and throughput were measured using llama-perplexity and llama-bench from llama.cpp (commit ec7f3ac9a, build 4514).

4.1 Perplexity on math reasoning text

Evaluation corpus: eval_science.txt, a held‑out subset of the same MetaMathQA distribution (math word problems with chain‑of‑thought and answers).

Measured with:

./llama.cpp/build/bin/llama-perplexity \
  -m <MODEL> \
  -f data/eval_science.txt \
  --chunks -1 \
  -ngl 40

Results (lower is better):

Model	Type	PPL (↓)	Std. error
BF16 baseline	BF16 GGUF	9.74	± 0.29
Q4_K_M (generic imatrix)	Q4_K_M + Wiki	9.71	± 0.29
Q4_K_M (science imatrix)	Q4_K_M + MetaMathQA	9.77	± 0.29

Interpretation:

All three models are within ~0.5% of each other and well inside the ±3% statistical uncertainty of this experiment.
In this Q4_K_M regime, domain‑specific imatrix calibration preserves math‑domain perplexity extremely well but does not yield a large, statistically clear improvement over generic calibration on this particular eval slice.

4.2 Throughput on A100 80GB

Prompt‑processing throughput was measured during the perplexity runs (more representative than synthetic microbenchmarks):

Command: llama-perplexity ... -ngl 40
Context size: 512, batch size 2048, n_seq=4.

Approximate prompt‑eval speeds:

Model	Type	Tokens/sec (↑)
BF16 baseline	BF16 GGUF	~73 t/s
Q4_K_M (generic imatrix)	Q4_K_M + Wiki	~164 t/s
Q4_K_M (science imatrix)	Q4_K_M + MetaMathQA	~152 t/s

Interpretation:

Q4_K_M achieves roughly 2× faster prompt processing than BF16 on an A100 80GB with -ngl 40.
Science‑ and generic‑calibrated Q4_K_M models are similar in speed; minor differences are within normal run‑to‑run variance.

5. Usage

5.1 `llama.cpp` (CLI)

./llama.cpp/build/bin/llama-cli \
  -m DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
  -ngl 40 \
  -c 4096 \
  -n 256 \
  -p "Solve the following problem step by step.\n\nQuestion: A tank contains 120 liters of water. It is drained at a rate of 5 liters per minute. How long does it take to empty the tank?"

You can adjust:

-ngl depending on your GPU memory.
-c for context length.
-n for number of generated tokens.

5.2 Other runtimes

This is a plain GGUF model; it should work in:

LM Studio
Ollama (after import)
Other GGUF‑aware serving frameworks

as long as they support Q4_K_M and the Llama 70B architecture.

6. Intended Use & Limitations

Intended use:
- Local experimentation with 70B‑scale reasoning on math / quantitative tasks.
- Research on quantization, importance matrices, and math reasoning.
- Educational demos of domain‑calibrated quantization.
Not intended for:
- Safety‑critical systems.
- Deployment without additional filtering / moderation.
- Use cases where correctness of every answer is guaranteed.

Safety & behavior

DeepSeek‑R1‑family models are known to be strong reasoners but not strongly safety‑tuned by default, and they can be vulnerable to jailbreak / prompt‑injection attacks. (WIRED)

This quantized variant inherits those behaviors. If you deploy it in user‑facing settings, you should add your own safety layers (prompting, filtering, or external moderation).

7. Acknowledgements

Base model: DeepSeek team for DeepSeek-R1-Distill-Llama-70B. (Hugging Face)
Math dataset: MetaMath team for the MetaMathQA dataset and associated paper. (arXiv)
Tooling: llama.cpp contributors for the GGUF format, quantization, and imatrix tooling.

8. Citation

If you use this model in academic work, please consider citing:

MetaMath / MetaMathQA (for the dataset). (arXiv)
DeepSeek‑R1 distillation work (for the base model). (Hugging Face)

You may additionally cite this model card as:

“DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF), Hugging Face model card, 2025.”

Downloads last month: 20

GGUF

Model size

71B params

Architecture

llama

Hardware compatibility

4-bit

Model tree for ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Quantized

(62)

this model

ErikFeng
/

DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF)

1. Model Summary

2. Files in this repository

3. Quantization & Calibration Details

3.1 Base conversion (HF → BF16 GGUF)

3.2 Calibration datasets

3.3 Importance matrix (imatrix) computation

3.4 Q4_K_M quantization

4. Evaluation

4.1 Perplexity on math reasoning text

4.2 Throughput on A100 80GB

5. Usage

5.1 `llama.cpp` (CLI)

5.2 Other runtimes

6. Intended Use & Limitations

Safety & behavior

7. Acknowledgements

8. Citation

Model tree for ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

Dataset used to train ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF)

1. Model Summary

2. Files in this repository

3. Quantization & Calibration Details

3.1 Base conversion (HF → BF16 GGUF)

3.2 Calibration datasets

3.3 Importance matrix (imatrix) computation

3.4 Q4_K_M quantization

4. Evaluation

4.1 Perplexity on math reasoning text

4.2 Throughput on A100 80GB

5. Usage

5.1 llama.cpp (CLI)

5.2 Other runtimes

6. Intended Use & Limitations

Safety & behavior

7. Acknowledgements

8. Citation

Model tree for ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

Dataset used to train ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF

5.1 `llama.cpp` (CLI)