Instructions to use deepseek-ai/DeepSeek-V3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V3", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V3", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepseek-ai/DeepSeek-V3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V3

SGLang

How to use deepseek-ai/DeepSeek-V3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V3 with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V3
```

minimum vram?

by CHNtentes - opened Dec 27, 2024

Discussion

CHNtentes

Dec 27, 2024

not very familiar with moe models. does it require 685GB or 37GB vram?

shuimu1337

Dec 27, 2024

need a100 x 10

surak

Dec 27, 2024

@CHNtentes it needs about 1tb vram

DeFactOfficial

Dec 27, 2024

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

bullerwins

Dec 28, 2024

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

you could try with vLLM as it has CPU offloading with
--cpu-offload-gb 900

breadlicker45

Dec 28, 2024

This comment has been hidden

kingo55

Dec 31, 2024

Is it feasible this will run on only 160gb VRAM with the right quantization?

bullerwins

Dec 31, 2024

Is it feasible this will run on only 160gb VRAM with the right quantization?

i mean, anything can theoretically be run anywhere if you quantize it enough. It's usually considered that at least 4bpw/Q4 is the minimum to retain good quality. So for Deepseek 3 what would equal to around 380GB VRAM (with a small context size). Once/if llama.cpp/GGUF is compatible, we can offload some layers to CPU RAM, being a MoE has the benefit of still maintaining decent speed even while on RAM.

So I would say a total of 400GB of VRAM+RAM would be necessary, the more proportion of VRAM the better.

NikolaSigmoid

Jan 3, 2025

•

edited Jan 3, 2025

What if you have a single GPU with 48GB VRAM and 1tb ordinary system RAM? Someone told me that it's possible to separate the layers so that only the active expert (37GB if using a Q8) is in VRAM at any given time, and the rest is in system RAM...

I have no doubt this is possible to do - but would the performance be even close to usable??

you could try with vLLM as it has CPU offloading with
--cpu-offload-gb 900

tried it but does not work!

abdurahmanshiine

Jan 5, 2025

Wait, so this can't be run locally on a regular consumer gpu?

melroy89

Jan 8, 2025

•

edited Jan 8, 2025

Wait, so this can't be run locally on a regular consumer gpu?

I was thinking the same thing. I have just one (pretty decent) GPU. We well see I guess. Maybe use one of the GGUF quantized versions: https://huggingface.co/bullerwins/DeepSeek-V3-GGUF

But in general I'm afraid this will not work, since 671B model scared me so much. I'm still in shock.

krustik

Jan 12, 2025

•

edited Jan 12, 2025

Wait, so this can't be run locally on a regular consumer gpu?

Q5-K-M GGUF on CPU, exactly 502 Gb of RAM, Gpu can help with offloading, running right now on 10 year old server hardware, consumer motherboards are main barrier for access

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment