Instructions to use issai/Qolda_GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use issai/Qolda_GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="issai/Qolda_GGUF", filename="BF16/Qolda-BF16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use issai/Qolda_GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf issai/Qolda_GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf issai/Qolda_GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf issai/Qolda_GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf issai/Qolda_GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf issai/Qolda_GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf issai/Qolda_GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf issai/Qolda_GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf issai/Qolda_GGUF:Q4_K_M
Use Docker
docker model run hf.co/issai/Qolda_GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use issai/Qolda_GGUF with Ollama:
ollama run hf.co/issai/Qolda_GGUF:Q4_K_M
- Unsloth Studio new
How to use issai/Qolda_GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for issai/Qolda_GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for issai/Qolda_GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for issai/Qolda_GGUF to start chatting
- Pi new
How to use issai/Qolda_GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf issai/Qolda_GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "issai/Qolda_GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use issai/Qolda_GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf issai/Qolda_GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default issai/Qolda_GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use issai/Qolda_GGUF with Docker Model Runner:
docker model run hf.co/issai/Qolda_GGUF:Q4_K_M
- Lemonade
How to use issai/Qolda_GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull issai/Qolda_GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qolda_GGUF-Q4_K_M
List all available models
lemonade list
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GGUF Inference Performance Benchmarks
Original Model: issai/Qolda
Benchmarks conducted using llama.cpp with llama-bench.
About Qolda
Qolda is a 4.3B parameter vision-language model designed for Kazakh, Russian, and English. Built on InternVL3.5 and Qwen3, it combines the InternViT-300M vision encoder with the Qwen3-4B language model. The name reflects both accessibility ("қолда" — in hand) and support ("қолдау" — to support).
Test Configuration
| Parameter | Value |
|---|---|
| Prompt tokens (pp) | 1024 |
| Generation tokens (tg) | 256 |
| Runs per test | 3 |
| Flash Attention | Enabled |
Hardware Configurations
NVIDIA A100-SXM4-40GB
- Backend: CUDA + NVIDIA Tensor Cores (Ampere architecture)
- Memory: 40 GB HBM2e
- Memory Bandwidth: 1,555 GB/s
- Compute Performance (Peak):
- FP16 / BF16 (Tensor-Core): 312 TFLOPS
- FP16 / BF16 w/ Sparsity (Tensor-Core): 624 TFLOPS
- INT8 (Tensor-Core): 624 TOPS
- INT8 w/ Sparsity (Tensor-Core): 1,248 TOPS
- FP32 (CUDA cores): 19.5 TFLOPS
- FP64 (CUDA cores): 9.7 TFLOPS
Apple MacBook Pro M4 Pro
- Backend: Metal + Apple Silicon GPU (unified memory architecture)
- GPU: Apple M4 Pro 16-core GPU
- Memory: 24 GB unified LPDDR5X
- Memory Bandwidth: 273 GB/s
- Compute Performance (Peak):
- FP32: ~7.4 TFLOPS
Qualcomm Snapdragon 8 Gen 3 (OnePlus 13)
- Backend: CPU (ARM NEON / I8MM-enabled)
- CPU: Octa-core (1×3.3 GHz Cortex-X4 + 5×3.2 GHz Cortex-A720 + 2×2.3 GHz Cortex-A520)
- GPU: Adreno 750 (not used in this benchmark)
- Memory: 16 GB LPDDR5X
- Memory Bandwidth: 77 GB/s
- Compute Performance (Peak, GPU):
- FP32: ~2.7 TFLOPS
Qualcomm Snapdragon 8 Elite Gen 5 (2026 Flagship — Projected)
- Backend: CPU (ARM NEON / I8MM-enabled)
- Process: 4 nm (TSMC)
- CPU: Octa-core (2+6 Cluster), 4.6 GHz
- GPU: Adreno 840 (not used in this benchmark)
- Memory: 16 GB LPDDR5X
- Compute Performance (Peak, GPU):
- FP32: ~5.5 TFLOPS
Benchmark Results
Prompt Processing (pp1024) — tokens/second
| Quantization | Size | A100 | M4 | SD 8 Gen 3 | SD 8 Elite Gen 5 (proj.) |
|---|---|---|---|---|---|
| F16 | 7.49 GiB | 13,576.25 | 797.20 | 13.72 | 31.55 |
| Q8_0 | 3.98 GiB | 7,731.88 | 726.31 | 14.71 | 33.83 |
| Q6_K | 3.07 GiB | 6,953.67 | 651.65 | 8.59 | 19.76 |
| Q5_K_M | 2.69 GiB | 7,387.72 | 686.72 | 8.84 | 20.33 |
| Q5_K_S | 2.62 GiB | 7,556.95 | 654.02 | 7.72 | 17.76 |
| Q4_K_M | 2.32 GiB | 7,589.05 | 742.13 | 18.07 | 41.56 |
| Q4_K_S | 2.21 GiB | 7,783.02 | 734.80 | 20.70 | 47.61 |
| Q4_1 | 2.41 GiB | 7,751.64 | 792.08 | 11.60 | 26.68 |
| Q4_0 | 2.20 GiB | 7,853.97 | 739.13 | 36.96 | 85.00 |
| IQ4_NL | 2.22 GiB | 7,799.43 | 746.65 | 18.99 | 43.68 |
| IQ4_XS | 2.12 GiB | 8,032.40 | 726.70 | 8.74 | 20.10 |
| Q3_K_M | 1.93 GiB | 6,264.09 | 643.15 | 10.92 | 25.12 |
| Q3_K_S | 1.75 GiB | 5,717.55 | 633.62 | 7.75 | 17.82 |
| Q2_K | 1.55 GiB | 5,154.50 | 660.86 | 8.70 | 20.01 |
| TQ1_0 | 1.01 GiB | — | — | 11.96 | 27.50 |
Text Generation (tg256) — tokens/second
| Quantization | Size | A100 | M4 | SD 8 Gen 3 | SD 8 Elite Gen 5 (proj.) |
|---|---|---|---|---|---|
| F16 | 7.49 GiB | 123.46 | 26.03 | 4.34 | 9.98 |
| Q8_0 | 3.98 GiB | 154.43 | 47.66 | 7.50 | 17.25 |
| Q6_K | 3.07 GiB | 150.04 | 43.91 | 6.31 | 14.52 |
| Q5_K_M | 2.69 GiB | 169.12 | 57.77 | 7.16 | 16.46 |
| Q5_K_S | 2.62 GiB | 174.78 | 58.70 | 6.70 | 15.41 |
| Q4_K_M | 2.32 GiB | 179.73 | 68.92 | 9.73 | 22.38 |
| Q4_K_S | 2.21 GiB | 186.12 | 71.96 | 10.67 | 24.54 |
| Q4_1 | 2.41 GiB | 207.61 | 71.21 | 6.72 | 15.45 |
| Q4_0 | 2.20 GiB | 199.58 | 71.02 | 12.37 | 28.45 |
| IQ4_NL | 2.22 GiB | 194.92 | 69.07 | 10.77 | 24.77 |
| IQ4_XS | 2.12 GiB | 201.83 | 69.76 | 7.40 | 17.02 |
| Q3_K_M | 1.93 GiB | 150.29 | 61.93 | 8.19 | 18.83 |
| Q3_K_S | 1.75 GiB | 138.07 | 58.16 | 6.97 | 16.03 |
| Q2_K | 1.55 GiB | 166.89 | 70.61 | 8.21 | 18.88 |
| TQ1_0 | 1.01 GiB | — | — | 8.82 | 20.28 |
Vision Encoder Benchmarks (InternViT-300M)
The vision encoder processes images separately from the LLM backbone. These benchmarks measure end-to-end image processing latency including encoding and decoding stages.
Image Processing Latency
| Stage | A100 | M4 Pro | SD 8 Gen 3 |
|---|---|---|---|
| Image Slice Encoding | 17 ms | 295 ms | 7,285 ms |
| Image Decoding (total) | 79 ms | 1,375 ms | 9,112 ms |
| Total Processing | 96 ms | 1,670 ms | 16,397 ms |
Memory Usage
| Device | BF16 | Q8_0 |
|---|---|---|
| Desktop (M4 Pro) | 9.6 GB | 1.4 GB |
| Mobile (SD 8 Gen 3) | 1.5 GB | 1.2 GB |
Notes
- A100 Results: Full GPU offload with flash attention enabled
- Macbook M4 Pro Results: Metal backend leveraging unified memory architecture
- Snapdragon Results: CPU-only inference; Q4_0 shows exceptional ARM NEON optimization
- All measurements report mean ± standard deviation across 3 runs
- Higher values indicate better performance
- These benchmarks are for the LLM backbone (Qwen3-4B) component; vision encoder runs separately
License
This model is licensed under the Apache License 2.0.
- Downloads last month
- 112
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit