Shunya-0.5B-Instruct

Shunya-0.5B-Instruct is an instruction-tuned version of Shunya-0.5B-Base, fine-tuned using ORPO on a curated preference dataset mix. It supports multi-turn chat, system prompts, and tool calling via a custom chat template.

Model Details

Field	Value
Architecture	LlamaForCausalLM
Parameters	~503M
Hidden size	1,280
Intermediate size	4,864
Layers	20
Attention heads	10
KV heads (GQA)	2
Head dim	128
Vocab size	40,008
Context window	32,768 tokens
Positional encoding	RoPE (θ=10,000, 4× linear scaling)
Normalization	RMSNorm (ε=1e-6)
Activation	SiLU
Dtype	bfloat16
Tied embeddings	Yes

Training

Base model: vivekmarakana/shunya-0.5b-base
Method: SFT (Supervised Fine Tuning) & ORPO (Odds Ratio Preference Optimization)
Fine-tuning datasets:
- HuggingFaceTB/smoltalk2
- NousResearch/hermes-function-calling-v1
- lmsys/lmsys-chat-1m
- argilla/ultrafeedback-multi-binarized-preferences-cleaned
- mlabonne/orpo-dpo-mix-40k
License: MIT

Usage

Chat

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("vivekmarakana/shunya-0.5b-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "vivekmarakana/shunya-0.5b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Tool Calling

The model supports structured tool calls using <tool_call> / <tool_response> tags. Pass a list of tools to apply_chat_template:

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in Mumbai?"}]
inputs = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=False))

Benchmarks

All evaluations were run with lm-evaluation-harness. Results use normalized accuracy (acc_norm) for completion tasks (ARC, PIQA, HellaSwag) and acc for classification tasks.

Benchmark	Shots	Metric	Shunya-0.5B-Instruct	Qwen3-0.6B-Instruct	Gemma3-1B-IT
ARC-Challenge	25	acc_norm	25.17	30.12	38.23
ARC-Easy	0	acc_norm	40.49	34.68	47.60
HellaSwag	10	acc_norm	34.94	38.42	41.22
MMLU	5	acc	24.38	22.95	29.08
WinoGrande	0	acc	52.49	53.51	55.25
BoolQ	0	acc	48.04	37.83	74.19
PIQA	0	acc_norm	64.85	64.96	68.88
Social IQA	0	acc	37.46	37.21	42.43
GPQA Main	5	acc	24.33	21.43	25.45
GPQA Diamond	5	acc	28.79	20.20	26.77
AGIEval EN	5	acc	16.80	17.68	18.43

Qwen3-0.6B-Instruct and Gemma3-1B-IT are included as reference points — both have more parameters and/or larger training budgets.

Notable results:

On GPQA Diamond (graduate-level science), Shunya-0.5B-Instruct (28.79%) outperforms both Qwen3-0.6B-Instruct (20.20%) and Gemma3-1B-IT (26.77%).
On ARC-Easy and BoolQ, Shunya-0.5B-Instruct outperforms Qwen3-0.6B-Instruct.

License

MIT

Downloads last month: 20

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for vivekmarakana/shunya-0.5b-instruct

Base model

vivekmarakana/shunya-0.5b-base

Finetuned

(1)

this model

Quantizations

1 model

Datasets used to train vivekmarakana/shunya-0.5b-instruct

Evaluation results

Accuracy, Normalized (25-shot) on ARC Challenge
self-reported

0.252
Accuracy, Normalized (0-shot) on ARC Easy
self-reported

0.405
Accuracy, Normalized (10-shot) on HellaSwag
self-reported

0.349
Accuracy (5-shot) on MMLU
self-reported

0.244
Accuracy (0-shot) on WinoGrande
self-reported

0.525
Accuracy (0-shot) on BoolQ
self-reported

0.480
Accuracy, Normalized (0-shot) on PIQA
self-reported

0.648
Accuracy (0-shot) on Social IQa
self-reported

0.375
Accuracy (5-shot) on GPQA Main
self-reported

0.243
Accuracy (5-shot) on GPQA Diamond
self-reported

0.288
Accuracy (5-shot) on AGIEval English
self-reported

0.168