Korean SBERT 384-dim
jhgan/ko-sbert-multitaskμ μ°¨μμ 768μμ 384λ‘ μΆμν λͺ¨λΈμ λλ€.
μ±λ₯ μ§ν
- π μ°¨μ: 768 β 384 (50% κ°μ)
- π― μ μ¬λ 보쑴μ¨: 99.77%
- π μ΅μ’ μμ€: 0.000205
- π’ νμ΅ μν: 800κ°
λ°μ΄ν° μΆμ²
- KorNLI (μμ°μ΄ μΆλ‘ )
- KorSTS (μλ―Έ μ μ¬λ)
- NSMC (μν 리뷰)
- KorQuAD (μ§μμλ΅)
μ¬μ©λ²
import torch
from transformers import AutoModel, AutoTokenizer
class DimensionReducer(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(768, 384)
self.layer_norm = torch.nn.LayerNorm(384)
def forward(self, x):
return self.layer_norm(self.linear(x))
# λͺ¨λΈ λ‘λ
model = AutoModel.from_pretrained("kimseongsan/ko-sbert-384")
tokenizer = AutoTokenizer.from_pretrained("kimseongsan/ko-sbert-384")
# Reducer λ‘λ
reducer = DimensionReducer()
reducer.load_state_dict(torch.load("reducer.pt"))
reducer.eval()
def encode(sentences):
if isinstance(sentences, str):
sentences = [sentences]
inputs = tokenizer(sentences, padding=True, truncation=True,
max_length=128, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
attention_mask = inputs['attention_mask']
token_embeddings = outputs.last_hidden_state
# Mean pooling
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
embeddings = sum_embeddings / sum_mask
# μ°¨μ μΆμ
reduced = reducer(embeddings)
return reduced
# μμ
sentences = ["μλ
νμΈμ", "λ°κ°μ΅λλ€"]
embeddings = encode(sentences)
print(f"Shape: {embeddings.shape}") # torch.Size([2, 384])
νμ΅ μΈλΆμ¬ν
- Optimizer: Adam (lr=0.001)
- Loss: MSE on similarity matrices
- Epochs: 100
- Batch Size: 32
- Device: cuda:0
λ€μ λ¨κ³
μ΄ λͺ¨λΈμ INT8 μμννλ €λ©΄:
python quantize_model.py --model kimseongsan/ko-sbert-384
- Downloads last month
- -