Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
103.9
TFLOPS
Sean Li
PRO
Hellohal2064
Follow
souhaiebtar's profile picture
Fishtiks's profile picture
SourBNulink's profile picture
5 followers
Ā·
1 following
Hellohal2064
seanli3
AI & ML interests
AI Infrastructure Engineer | Dual DGX Sparks (230GB VRAM) | 5-node Docker Swarm | Building AI Coworker systems
Recent Activity
reacted
to
their
post
with š„
14 days ago
š Excited to share: The vLLM container for NVIDIA DGX Spark! I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp! š Performance Highlights: ⢠Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp) ⢠Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp) š§ Technical Challenges Solved: ⢠Built PyTorch nightly with CUDA 13.1 + SM121 support ⢠Patched vLLM for Blackwell architecture ⢠Created custom MoE expert configs for GB10 ⢠Implemented TRITON_ATTN backend workaround š¦ Available now: ⢠Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest ⢠HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10 The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!
replied
to
their
post
14 days ago
š Excited to share: The vLLM container for NVIDIA DGX Spark! I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp! š Performance Highlights: ⢠Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp) ⢠Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp) š§ Technical Challenges Solved: ⢠Built PyTorch nightly with CUDA 13.1 + SM121 support ⢠Patched vLLM for Blackwell architecture ⢠Created custom MoE expert configs for GB10 ⢠Implemented TRITON_ATTN backend workaround š¦ Available now: ⢠Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest ⢠HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10 The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!
replied
to
their
post
14 days ago
š Excited to share: The vLLM container for NVIDIA DGX Spark! I've been working on getting vLLM to run natively on the new DGX Spark with its GB10 Blackwell GPU (SM121 architecture). The results? 2.5x faster inference compared to llama.cpp! š Performance Highlights: ⢠Qwen3-Coder-30B: 44 tok/s (vs 21 tok/s with llama.cpp) ⢠Qwen3-Next-80B: 45 tok/s (vs 18 tok/s with llama.cpp) š§ Technical Challenges Solved: ⢠Built PyTorch nightly with CUDA 13.1 + SM121 support ⢠Patched vLLM for Blackwell architecture ⢠Created custom MoE expert configs for GB10 ⢠Implemented TRITON_ATTN backend workaround š¦ Available now: ⢠Docker Hub: docker pull hellohal2064/vllm-dgx-spark-gb10:latest ⢠HuggingFace: huggingface.co/Hellohal2064/vllm-dgx-spark-gb10 The DGX Spark's 119GB unified memory opens up possibilities for running massive models locally. Happy to connect with others working on the DGX Spark Blackwell!
View all activity
Organizations
Hellohal2064
's Spaces
1
Sort:Ā Recently updated
No application file
Vllm Dgx Spark Gb10 Docker
š¢