SmolLM3-3B-instruct-customerservice

This model is a QLoRA fine-tuned version of HuggingFaceTB/SmolLM3-3B-Instruct on a context-summarized multi-turn customer-service QA dataset for banking domain conversations.

Model Description

This is a QLoRA (Quantized Low-Rank Adaptation) fine-tuned version of SmolLM3-3B-Instruct optimized for multi-turn customer-service question answering with context summarization. The model was trained on synthetic banking customer-service conversations with history summarization to preserve essential conversational context while maintaining dialogue continuity.

Base Model: HuggingFaceTB/SmolLM3-3B-Instruct
Parameters: ~3 billion
Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
Domain: Customer Service (Banking)
Task: Context-Summarized Multi-Turn Question Answering
Note: Reasoning capabilities disabled during training and inference (no thinking tags)

Intended Uses & Limitations

Intended Uses

  • Multi-turn customer service conversations in banking domain
  • Context-aware response generation with dialogue continuity
  • Real-time customer support automation
  • Efficient deployment on resource-constrained hardware
  • Privacy-preserving on-premise deployment

Limitations

  • Primarily trained on banking domain data; may require adaptation for other sectors
  • Performance based on synthetic data; real-world variability may differ
  • Requires context summarization for optimal performance
  • Maximum sequence length: 512 tokens
  • Lower performance compared to other 3B models (LLaMA, Qwen, Phi)
  • Struggles with dialogue continuity and contextual alignment

Training Data

Dataset: Synthetic context-summarized multi-turn customer-service QA dataset
Source: Derived from TalkMap Banking Conversation Corpus
Size: 128,335 training instances, 18,333 validation instances
Conversation Turns: 2-53 turns per conversation (avg: 10.06)
Context Strategy: History summarization using GPT-4o-mini
Response Refinement: GPT-4.1-based response quality enhancement

Training Procedure

Training Configuration

  • Framework: Unsloth + Hugging Face Transformers
  • Fine-tuning Method: QLoRA (4-bit quantization)
  • Hardware: NVIDIA RTX A100 40GB GPU
  • Training Time: 5-14 hours

Training Hyperparameters

  • Max Sequence Length: 512 tokens
  • Quantization: 4-bit precision
  • LoRA Rank (r): 16
  • LoRA Alpha: 32
  • LoRA Dropout: 0.1
  • LoRA Target Modules: All attention and feed-forward projection layers
  • Epochs: 3
  • Optimizer: AdamW 8-bit
  • Learning Rate: 2e-5
  • Weight Decay: 0.01
  • Warmup Ratio: 0.05
  • LR Scheduler: Cosine

Inference Parameters

generation_config = {
    "max_new_tokens": 128,
    "temperature": 0.6,
    "do_sample": True,
    "top_p": 0.95,
    "top_k": 50,
}

Usage Example

Installation

pip install unsloth transformers peft torch

Loading the Model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B-Instruct",
    device_map="auto",
    torch_dtype=torch.float16,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Lakshan2003/SmolLM3-3B-instruct-customerservice")

# Merge adapter (optional, for deployment)
model = model.merge_and_unload()
model.eval()

Inference

# Prompt template (adjust for SmolLM format)
prompt_template = """<|im_start|>system
{instruction}<|im_end|>
<|im_start|>user
Conversation History:
{history}

Client Question:
{client_question}<|im_end|>
<|im_start|>assistant
"""

# Example conversation
instruction = "You are a professional call-center customer service agent working at Optimal Financial Partners. Review the conversation history and any provided context (if available). Make sure your response is consistent with the conversation history (names, issues, and actions already taken). If no history is given, treat the client’s message as the start of the conversation. Continue the dialogue as the agent by giving a clear, helpful, and professional response. Responses should sound natural and human-like, like a real phone call, and usually be few short sentences. Provide more detail when the client’s request clearly requires it."
history = "Kathrine has contacted Almira from Optimal Financial Partners regarding unexpected charges on her statement and her rights as a consumer. Almira confirmed that as a customer, Kathrine has the right to dispute any unauthorized or incorrect charges. Almira offered to investigate any charges Kathrine believes are incorrect. No specific charges, amounts, or account identifiers have been mentioned, and no verification steps have been completed or are pending at this time. The conversation is currently focused on explaining consumer rights and the process for disputing charges."
client_question = "That's great to know. What if I'm not satisfied with the outcome of the investigation?"

# Format input
input_text = prompt_template.format(
    instruction=instruction,
    history=history,
    client_question=client_question
)

# Tokenize
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.6,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode response
input_length = inputs.input_ids.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()
print(response)

Framework Versions

  • PEFT: 0.14.0
  • Transformers: 4.47.0
  • PyTorch: 2.5.1+cu121
  • Unsloth: Latest (training framework)

Citation

If you use this model, please cite:

@article{cooray2026small,
  title={Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation},
  author={Cooray, Lakshan and Sumanathilaka, Deshan and Raju, Pattigadapa Venkatesh},
  journal={arXiv preprint arXiv:2602.00665},
  year={2026}
}

Model Card Contact

Author: Lakshan Cooray
Institution: Informatics Institute of Technology, Colombo, Sri Lanka
Email: [email protected]

License

This model inherits the license from the base SmolLM3-3B-Instruct model. Please refer to Hugging Face's license agreement.

Ethical Considerations

  • Model trained on synthetic banking data to preserve privacy
  • Should be used with human oversight in production environments
  • May require domain adaptation for non-banking customer service
  • Performance may vary on real-world data with different distributions
  • Lower performance suggests need for careful evaluation before deployment
  • Consider alternative models for production customer-service applications
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lakshan2003/SmolLM3-3B-instruct-customerservice

Adapter
(23)
this model

Dataset used to train Lakshan2003/SmolLM3-3B-instruct-customerservice

Paper for Lakshan2003/SmolLM3-3B-instruct-customerservice