Korean SBERT 384-dim

jhgan/ko-sbert-multitask의 차원을 768μ—μ„œ 384둜 μΆ•μ†Œν•œ λͺ¨λΈμž…λ‹ˆλ‹€.

μ„±λŠ₯ μ§€ν‘œ

  • πŸ“ 차원: 768 β†’ 384 (50% κ°μ†Œ)
  • 🎯 μœ μ‚¬λ„ 보쑴율: 99.77%
  • πŸ“Š μ΅œμ’… 손싀: 0.000205
  • πŸ”’ ν•™μŠ΅ μƒ˜ν”Œ: 800개

데이터 좜처

  • KorNLI (μžμ—°μ–΄ μΆ”λ‘ )
  • KorSTS (의미 μœ μ‚¬λ„)
  • NSMC (μ˜ν™” 리뷰)
  • KorQuAD (μ§ˆμ˜μ‘λ‹΅)

μ‚¬μš©λ²•

import torch
from transformers import AutoModel, AutoTokenizer

class DimensionReducer(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(768, 384)
        self.layer_norm = torch.nn.LayerNorm(384)
    
    def forward(self, x):
        return self.layer_norm(self.linear(x))

# λͺ¨λΈ λ‘œλ“œ
model = AutoModel.from_pretrained("kimseongsan/ko-sbert-384")
tokenizer = AutoTokenizer.from_pretrained("kimseongsan/ko-sbert-384")

# Reducer λ‘œλ“œ
reducer = DimensionReducer()
reducer.load_state_dict(torch.load("reducer.pt"))
reducer.eval()

def encode(sentences):
    if isinstance(sentences, str):
        sentences = [sentences]
    
    inputs = tokenizer(sentences, padding=True, truncation=True, 
                      max_length=128, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        attention_mask = inputs['attention_mask']
        token_embeddings = outputs.last_hidden_state
        
        # Mean pooling
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        embeddings = sum_embeddings / sum_mask
        
        # 차원 μΆ•μ†Œ
        reduced = reducer(embeddings)
    
    return reduced

# μ˜ˆμ‹œ
sentences = ["μ•ˆλ…•ν•˜μ„Έμš”", "λ°˜κ°‘μŠ΅λ‹ˆλ‹€"]
embeddings = encode(sentences)
print(f"Shape: {embeddings.shape}")  # torch.Size([2, 384])

ν•™μŠ΅ 세뢀사항

  • Optimizer: Adam (lr=0.001)
  • Loss: MSE on similarity matrices
  • Epochs: 100
  • Batch Size: 32
  • Device: cuda:0

λ‹€μŒ 단계

이 λͺ¨λΈμ„ INT8 μ–‘μžν™”ν•˜λ €λ©΄:

python quantize_model.py --model kimseongsan/ko-sbert-384
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support