Shunya-0.5B-Instruct

Shunya-0.5B-Instruct is an instruction-tuned version of Shunya-0.5B-Base, fine-tuned using ORPO on a curated preference dataset mix. It supports multi-turn chat, system prompts, and tool calling via a custom chat template.

Model Details

Field Value
Architecture LlamaForCausalLM
Parameters ~503M
Hidden size 1,280
Intermediate size 4,864
Layers 20
Attention heads 10
KV heads (GQA) 2
Head dim 128
Vocab size 40,008
Context window 32,768 tokens
Positional encoding RoPE (θ=10,000, 4× linear scaling)
Normalization RMSNorm (ε=1e-6)
Activation SiLU
Dtype bfloat16
Tied embeddings Yes

Special tokens: <|system|>, <|user|>, <|assistant|>, <|endofturn|>, <tool_call>, </tool_call>, <tool_response>, </tool_response>

Training

  • Base model: vivekmarakana/shunya-0.5b-base
  • Method: SFT (Supervised Fine Tuning) & ORPO (Odds Ratio Preference Optimization)
  • Fine-tuning datasets:
    • HuggingFaceTB/smoltalk2
    • NousResearch/hermes-function-calling-v1
    • lmsys/lmsys-chat-1m
    • argilla/ultrafeedback-multi-binarized-preferences-cleaned
    • mlabonne/orpo-dpo-mix-40k
  • License: MIT

Usage

Chat

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("vivekmarakana/shunya-0.5b-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "vivekmarakana/shunya-0.5b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Tool Calling

The model supports structured tool calls using <tool_call> / <tool_response> tags. Pass a list of tools to apply_chat_template:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in Mumbai?"}]
inputs = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=False))

Benchmarks

All evaluations were run with lm-evaluation-harness. Results use normalized accuracy (acc_norm) for completion tasks (ARC, PIQA, HellaSwag) and acc for classification tasks.

Benchmark Shots Metric Shunya-0.5B-Instruct Qwen3-0.6B-Instruct Gemma3-1B-IT
ARC-Challenge 25 acc_norm 25.17 30.12 38.23
ARC-Easy 0 acc_norm 40.49 34.68 47.60
HellaSwag 10 acc_norm 34.94 38.42 41.22
MMLU 5 acc 24.38 22.95 29.08
WinoGrande 0 acc 52.49 53.51 55.25
BoolQ 0 acc 48.04 37.83 74.19
PIQA 0 acc_norm 64.85 64.96 68.88
Social IQA 0 acc 37.46 37.21 42.43
GPQA Main 5 acc 24.33 21.43 25.45
GPQA Diamond 5 acc 28.79 20.20 26.77
AGIEval EN 5 acc 16.80 17.68 18.43

Qwen3-0.6B-Instruct and Gemma3-1B-IT are included as reference points — both have more parameters and/or larger training budgets.

Notable results:

  • On GPQA Diamond (graduate-level science), Shunya-0.5B-Instruct (28.79%) outperforms both Qwen3-0.6B-Instruct (20.20%) and Gemma3-1B-IT (26.77%).
  • On ARC-Easy and BoolQ, Shunya-0.5B-Instruct outperforms Qwen3-0.6B-Instruct.

License

MIT

Downloads last month
20
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vivekmarakana/shunya-0.5b-instruct

Finetuned
(1)
this model
Quantizations
1 model

Datasets used to train vivekmarakana/shunya-0.5b-instruct

Evaluation results