SmolLM3-3B-instruct-customerservice

This model is a QLoRA fine-tuned version of HuggingFaceTB/SmolLM3-3B-Instruct on a context-summarized multi-turn customer-service QA dataset for banking domain conversations.

Model Description

This is a QLoRA (Quantized Low-Rank Adaptation) fine-tuned version of SmolLM3-3B-Instruct optimized for multi-turn customer-service question answering with context summarization. The model was trained on synthetic banking customer-service conversations with history summarization to preserve essential conversational context while maintaining dialogue continuity.

Base Model: HuggingFaceTB/SmolLM3-3B-Instruct
Parameters: ~3 billion
Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
Domain: Customer Service (Banking)
Task: Context-Summarized Multi-Turn Question Answering
Note: Reasoning capabilities disabled during training and inference (no thinking tags)

Intended Uses & Limitations

Intended Uses

Multi-turn customer service conversations in banking domain
Context-aware response generation with dialogue continuity
Real-time customer support automation
Efficient deployment on resource-constrained hardware
Privacy-preserving on-premise deployment

Limitations

Primarily trained on banking domain data; may require adaptation for other sectors
Performance based on synthetic data; real-world variability may differ
Requires context summarization for optimal performance
Maximum sequence length: 512 tokens
Lower performance compared to other 3B models (LLaMA, Qwen, Phi)
Struggles with dialogue continuity and contextual alignment

Training Data

Dataset: Synthetic context-summarized multi-turn customer-service QA dataset
Source: Derived from TalkMap Banking Conversation Corpus
Size: 128,335 training instances, 18,333 validation instances
Conversation Turns: 2-53 turns per conversation (avg: 10.06)
Context Strategy: History summarization using GPT-4o-mini
Response Refinement: GPT-4.1-based response quality enhancement

Training Procedure

Training Configuration

Framework: Unsloth + Hugging Face Transformers
Fine-tuning Method: QLoRA (4-bit quantization)
Hardware: NVIDIA RTX A100 40GB GPU
Training Time: 5-14 hours

Training Hyperparameters

Max Sequence Length: 512 tokens
Quantization: 4-bit precision
LoRA Rank (r): 16
LoRA Alpha: 32
LoRA Dropout: 0.1
LoRA Target Modules: All attention and feed-forward projection layers
Epochs: 3
Optimizer: AdamW 8-bit
Learning Rate: 2e-5
Weight Decay: 0.01
Warmup Ratio: 0.05
LR Scheduler: Cosine

Inference Parameters

generation_config = {
    "max_new_tokens": 128,
    "temperature": 0.6,
    "do_sample": True,
    "top_p": 0.95,
    "top_k": 50,
}

Usage Example

Installation

pip install unsloth transformers peft torch

Loading the Model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B-Instruct",
    device_map="auto",
    torch_dtype=torch.float16,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Lakshan2003/SmolLM3-3B-instruct-customerservice")

# Merge adapter (optional, for deployment)
model = model.merge_and_unload()
model.eval()

Inference

# Prompt template (adjust for SmolLM format)
prompt_template = """<|im_start|>system
{instruction}<|im_end|>
<|im_start|>user
Conversation History:
{history}

Client Question:
{client_question}<|im_end|>
<|im_start|>assistant
"""

# Example conversation
instruction = "You are a professional call-center customer service agent working at Optimal Financial Partners. Review the conversation history and any provided context (if available). Make sure your response is consistent with the conversation history (names, issues, and actions already taken). If no history is given, treat the client’s message as the start of the conversation. Continue the dialogue as the agent by giving a clear, helpful, and professional response. Responses should sound natural and human-like, like a real phone call, and usually be few short sentences. Provide more detail when the client’s request clearly requires it."
history = "Kathrine has contacted Almira from Optimal Financial Partners regarding unexpected charges on her statement and her rights as a consumer. Almira confirmed that as a customer, Kathrine has the right to dispute any unauthorized or incorrect charges. Almira offered to investigate any charges Kathrine believes are incorrect. No specific charges, amounts, or account identifiers have been mentioned, and no verification steps have been completed or are pending at this time. The conversation is currently focused on explaining consumer rights and the process for disputing charges."
client_question = "That's great to know. What if I'm not satisfied with the outcome of the investigation?"

# Format input
input_text = prompt_template.format(
    instruction=instruction,
    history=history,
    client_question=client_question
)

# Tokenize
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.6,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode response
input_length = inputs.input_ids.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()
print(response)

Framework Versions

PEFT: 0.14.0
Transformers: 4.47.0
PyTorch: 2.5.1+cu121
Unsloth: Latest (training framework)

Citation

If you use this model, please cite:

@article{cooray2026small,
  title={Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation},
  author={Cooray, Lakshan and Sumanathilaka, Deshan and Raju, Pattigadapa Venkatesh},
  journal={arXiv preprint arXiv:2602.00665},
  year={2026}
}

Model Card Contact

Author: Lakshan Cooray
Institution: Informatics Institute of Technology, Colombo, Sri Lanka
Email: [email protected]

License

This model inherits the license from the base SmolLM3-3B-Instruct model. Please refer to Hugging Face's license agreement.

Ethical Considerations

Model trained on synthetic banking data to preserve privacy
Should be used with human oversight in production environments
May require domain adaptation for non-banking customer service
Performance may vary on real-world data with different distributions
Lower performance suggests need for careful evaluation before deployment
Consider alternative models for production customer-service applications

Downloads last month: 9

Model tree for Lakshan2003/SmolLM3-3B-instruct-customerservice

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Adapter

(23)

this model

Dataset used to train Lakshan2003/SmolLM3-3B-instruct-customerservice

Paper for Lakshan2003/SmolLM3-3B-instruct-customerservice

Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Paper • 2602.00665 • Published 17 days ago • 1