DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF)
Science-calibrated 4‑bit GGUF of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, using an importance matrix computed on the MetaMathQA mathematical reasoning dataset.
This repository provides:
- A Q4_K_M GGUF of a 70B reasoning model, suitable for
llama.cppand compatible runtimes. - The science-domain importance matrix and the MetaMathQA‑based calibration / evaluation text used to drive quantization decisions.
- Simple perplexity and throughput benchmarks vs. the original BF16 GGUF and a generic (WikiText‑calibrated) Q4_K_M model.
Code & scripts: https://github.com/phangzs/science-calibrated-llm-quantization
1. Model Summary
- Base model:
deepseek-ai/DeepSeek-R1-Distill-Llama-70B– a 70B Llama‑architecture reasoning model distilled from DeepSeek‑R1, with strong math, coding, and general reasoning performance and a 131k token context window. :contentReference[oaicite:0]{index=0} - Format: GGUF v3, Q4_K_M (4‑bit K‑quant, “Medium”).
- Intended runtime:
llama.cpp(CUDA), LM Studio, Ollama (after import), and other GGUF‑compatible engines. - Domain focus: Math / quantitative reasoning, especially GSM8K/MATH‑style problems with chain‑of‑thought.
- Goal: Preserve BF16 perplexity on math‑heavy text while significantly reducing memory footprint and increasing throughput.
This model does not change the base model’s instruction format or tokenizer; it only changes the weight representation.
2. Files in this repository
Exact filenames may differ slightly depending on how you organized things, but this is the intended structure.
DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf
→ The science-calibrated Q4_K_M GGUF; this is the main model file you load inllama.cpp.imatrix_science.dat
→ Legacy (.dat) importance matrix produced byllama-imatrixon MetaMathQA text. This is not needed at inference time, but is included for transparency and reproducibility.science_calibration.txt
→ Plaintext calibration corpus used byllama-imatrixto buildimatrix_science.dat. Each line is a math question + chain‑of‑thought + answer sampled from MetaMathQA.eval_science.txt
→ Plaintext evaluation corpus (subset of MetaMathQA) used withllama-perplexityto measure perplexity for BF16 and Q4_K_M models.results/(optional)ppl_bf16.txtppl_q4_generic.txtppl_q4_science.txtbench_bf16.txtbench_q4_generic.txtbench_q4_science.txt
→ Raw logs fromllama-perplexityandllama-bench.
3. Quantization & Calibration Details
3.1 Base conversion (HF → BF16 GGUF)
Downloaded the base model:
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \ --local-dir models/DeepSeek-R1-Distill-Llama-70B \ --local-dir-use-symlinks FalseConverted to BF16 GGUF using
llama.cpp’s conversion script:python3 convert_hf_to_gguf.py \ --outtype bf16 \ --outfile models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \ models/DeepSeek-R1-Distill-Llama-70B
The BF16 GGUF is used as the source of truth for both generic and science quantizations.
3.2 Calibration datasets
Generic calibration (for comparison):
Source: WikiText‑2 (raw) via
Salesforce/wikitext(wikitext-2-raw-v1split). (Hugging Face)Used to compute a “generic” imatrix and a generic Q4_K_M model (hosted separately or referenced in this card).
Science calibration (this model):
Source:
meta-math/MetaMathQA. (Hugging Face)- MetaMathQA is a large mathematical reasoning dataset derived from GSM8K and MATH training sets, containing question–chain‑of‑thought–answer triples designed to teach math reasoning. (arXiv)
A random subset of MetaMathQA examples was formatted as:
Question: ... Reasoning: ... Answer: ...and concatenated into
science_calibration.txt. The emphasis is on long, step‑by‑step solutions, not just final answers.
3.3 Importance matrix (imatrix) computation
The importance matrix is computed using llama.cpp’s llama-imatrix example, which estimates per‑weight importance via forward passes over calibration text.
Command (run from the project root; paths are relative):
./llama.cpp/build/bin/llama-imatrix \
-m models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
-f data/science_calibration.txt \
--chunk 512 \
-ngl 40 \
--save-frequency 50 \
-o imatrix/imatrix_science.dat
Key settings:
--chunk 512
→ Process 512‑token chunks; balances VRAM use and throughput.-ngl 40
→ Offload 40 transformer layers to GPU on an A100 80GB.--save-frequency 50
→ Periodically saves partial imatrix data to avoid losing progress on long runs.
Note: On this model,
llama-imatrixinitially encountered aninf detected in blk.3.attn_q.weightcheck. For this project, the check was modified to clamp non‑finite entries to 0 instead of aborting, allowing imatrix computation to complete. This does not affect inference; it only affects the imatrix statistics used during quantization.
3.4 Q4_K_M quantization
The science‑calibrated Q4_K_M GGUF was produced with:
./llama.cpp/build/bin/llama-quantize \
--imatrix imatrix/imatrix_science.dat \
models/DeepSeek-R1-Distill-Llama-70B-f16.gguf \
models/DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
Q4_K_M
A separate generic Q4_K_M model was produced similarly, but with the generic imatrix and calibration text (WikiText‑2).
4. Evaluation
Perplexity and throughput were measured using llama-perplexity and llama-bench from llama.cpp (commit ec7f3ac9a, build 4514).
4.1 Perplexity on math reasoning text
Evaluation corpus: eval_science.txt, a held‑out subset of the same MetaMathQA distribution (math word problems with chain‑of‑thought and answers).
Measured with:
./llama.cpp/build/bin/llama-perplexity \
-m <MODEL> \
-f data/eval_science.txt \
--chunks -1 \
-ngl 40
Results (lower is better):
| Model | Type | PPL (↓) | Std. error |
|---|---|---|---|
| BF16 baseline | BF16 GGUF | 9.74 | ± 0.29 |
| Q4_K_M (generic imatrix) | Q4_K_M + Wiki | 9.71 | ± 0.29 |
| Q4_K_M (science imatrix) | Q4_K_M + MetaMathQA | 9.77 | ± 0.29 |
Interpretation:
All three models are within ~0.5% of each other and well inside the ±3% statistical uncertainty of this experiment.
In this Q4_K_M regime, domain‑specific imatrix calibration preserves math‑domain perplexity extremely well but does not yield a large, statistically clear improvement over generic calibration on this particular eval slice.
4.2 Throughput on A100 80GB
Prompt‑processing throughput was measured during the perplexity runs (more representative than synthetic microbenchmarks):
Command:
llama-perplexity ... -ngl 40Context size: 512, batch size 2048,
n_seq=4.
Approximate prompt‑eval speeds:
| Model | Type | Tokens/sec (↑) |
|---|---|---|
| BF16 baseline | BF16 GGUF | ~73 t/s |
| Q4_K_M (generic imatrix) | Q4_K_M + Wiki | ~164 t/s |
| Q4_K_M (science imatrix) | Q4_K_M + MetaMathQA | ~152 t/s |
Interpretation:
Q4_K_M achieves roughly 2× faster prompt processing than BF16 on an A100 80GB with
-ngl 40.Science‑ and generic‑calibrated Q4_K_M models are similar in speed; minor differences are within normal run‑to‑run variance.
5. Usage
5.1 llama.cpp (CLI)
./llama.cpp/build/bin/llama-cli \
-m DeepSeek-R1-Distill-Llama-70B-Q4_K_M-science.gguf \
-ngl 40 \
-c 4096 \
-n 256 \
-p "Solve the following problem step by step.\n\nQuestion: A tank contains 120 liters of water. It is drained at a rate of 5 liters per minute. How long does it take to empty the tank?"
You can adjust:
-ngldepending on your GPU memory.-cfor context length.-nfor number of generated tokens.
5.2 Other runtimes
This is a plain GGUF model; it should work in:
LM Studio
Ollama (after import)
Other GGUF‑aware serving frameworks
as long as they support Q4_K_M and the Llama 70B architecture.
6. Intended Use & Limitations
Intended use:
Local experimentation with 70B‑scale reasoning on math / quantitative tasks.
Research on quantization, importance matrices, and math reasoning.
Educational demos of domain‑calibrated quantization.
Not intended for:
Safety‑critical systems.
Deployment without additional filtering / moderation.
Use cases where correctness of every answer is guaranteed.
Safety & behavior
DeepSeek‑R1‑family models are known to be strong reasoners but not strongly safety‑tuned by default, and they can be vulnerable to jailbreak / prompt‑injection attacks. (WIRED)
This quantized variant inherits those behaviors. If you deploy it in user‑facing settings, you should add your own safety layers (prompting, filtering, or external moderation).
7. Acknowledgements
Base model: DeepSeek team for
DeepSeek-R1-Distill-Llama-70B. (Hugging Face)Math dataset: MetaMath team for the MetaMathQA dataset and associated paper. (arXiv)
Tooling:
llama.cppcontributors for the GGUF format, quantization, and imatrix tooling.
8. Citation
If you use this model in academic work, please consider citing:
MetaMath / MetaMathQA (for the dataset). (arXiv)
DeepSeek‑R1 distillation work (for the base model). (Hugging Face)
You may additionally cite this model card as:
“DeepSeek-R1-Distill-Llama-70B – Science-Calibrated Q4_K_M (GGUF), Hugging Face model card, 2025.”
- Downloads last month
- 20
4-bit
Model tree for ErikFeng/DeepSeek-R1-Distill-Llama-70B-Science-Q4_K_M-GGUF
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-70B