Korean Sentence-BERT 384d (Dimension-Reduced)

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์„ ์œ„ํ•œ ์ฐจ์› ์ถ•์†Œ Sentence-BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. 768์ฐจ์›์„ 384์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•˜์—ฌ 50% ๋” ๋น ๋ฅด๊ณ  ๊ฐ€๋ฒผ์šด ์ž„๋ฒ ๋”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ •๋ณด

  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: jhgan/ko-sbert-multitask
  • ์ž„๋ฒ ๋”ฉ ์ฐจ์›: 384 (์›๋ณธ: 768)
  • ์ฐจ์› ์ถ•์†Œ์œจ: 50%
  • ์œ ์‚ฌ๋„ ๋ณด์กด์œจ: 99.25%
  • ์ตœ์ข… ์†์‹ค: 0.000301
  • ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด: 128
  • ์–ธ์–ด: ํ•œ๊ตญ์–ด (Korean)
  • ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: sentence-transformers

์„ฑ๋Šฅ ํŠน์ง•

โœจ ๋น ๋ฅธ ์†๋„: 768์ฐจ์› ๋Œ€๋น„ ์•ฝ 2๋ฐฐ ๋น ๋ฅธ ์ฒ˜๋ฆฌ
๐Ÿ’พ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ: 50% ๊ฐ์†Œ๋œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
๐ŸŽฏ ๋†’์€ ์ •ํ™•๋„: 99.25% ์œ ์‚ฌ๋„ ๋ณด์กด
๐Ÿ“ฆ ์™„๋ฒฝํ•œ ํ˜ธํ™˜: SentenceTransformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์™„์ „ ์ง€์›

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์„ค์น˜

pip install sentence-transformers

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

from sentence_transformers import SentenceTransformer

# ๋ชจ๋ธ ๋กœ๋“œ
model = SentenceTransformer('YOUR_USERNAME/ko-sbert-384-reduced')

# ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ
sentences = ['์•ˆ๋…•ํ•˜์„ธ์š”', '๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', '์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค']
embeddings = model.encode(sentences)

print(f"์ž„๋ฒ ๋”ฉ shape: {embeddings.shape}")
# ์ถœ๋ ฅ: ์ž„๋ฒ ๋”ฉ shape: (3, 384)

๋ฌธ์žฅ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('YOUR_USERNAME/ko-sbert-384-reduced')

# ๋ฌธ์žฅ ์Œ
sentences = [
    '๋‚ ์”จ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค',
    '์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ •๋ง ์ข‹๋„ค์š”',
    'ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ฐฐ์šฐ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค'
]

# ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
embeddings = model.encode(sentences, convert_to_tensor=True)

# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
cosine_scores = util.cos_sim(embeddings, embeddings)

print("๋ฌธ์žฅ ์œ ์‚ฌ๋„ ํ–‰๋ ฌ:")
print(cosine_scores)

์˜๋ฏธ ๊ฒ€์ƒ‰ (Semantic Search)

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('YOUR_USERNAME/ko-sbert-384-reduced')

# ๋ฌธ์„œ ์ปฌ๋ ‰์…˜
documents = [
    '์ธ๊ณต์ง€๋Šฅ์€ ์ปดํ“จํ„ฐ ๊ณผํ•™์˜ ํ•œ ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค',
    '๋จธ์‹ ๋Ÿฌ๋‹์€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค',
    '์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค',
    '๋”ฅ๋Ÿฌ๋‹์€ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค',
    'ํ•œ๊ตญ์–ด๋Š” ๊ต์ฐฉ์–ด์ž…๋‹ˆ๋‹ค'
]

# ์ฟผ๋ฆฌ
query = '์ธ๊ณต์ง€๋Šฅ์— ๋Œ€ํ•ด ์•Œ๋ ค์ฃผ์„ธ์š”'

# ์ž„๋ฒ ๋”ฉ
doc_embeddings = model.encode(documents, convert_to_tensor=True)
query_embedding = model.encode(query, convert_to_tensor=True)

# ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
scores = util.cos_sim(query_embedding, doc_embeddings)[0]

# ์ƒ์œ„ ๊ฒฐ๊ณผ ์ •๋ ฌ
top_results = scores.argsort(descending=True)

print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for idx in top_results[:3]:
    print(f"  ์ ์ˆ˜: {scores[idx]:.4f} - {documents[idx]}")

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('YOUR_USERNAME/ko-sbert-384-reduced')

# ๋Œ€๋Ÿ‰์˜ ๋ฌธ์žฅ ์ฒ˜๋ฆฌ
large_corpus = ['๋ฌธ์žฅ 1', '๋ฌธ์žฅ 2', ...] # ์ˆ˜์ฒœ~์ˆ˜๋งŒ ๊ฐœ

# ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅด๊ฒŒ ์ž„๋ฒ ๋”ฉ
embeddings = model.encode(
    large_corpus,
    batch_size=64,
    show_progress_bar=True,
    convert_to_tensor=True
)

ํ™œ์šฉ ์‚ฌ๋ก€

  1. ๋ฌธ์žฅ ์œ ์‚ฌ๋„ ์ธก์ •: ๋‘ ๋ฌธ์žฅ์ด ์–ผ๋งˆ๋‚˜ ๋น„์Šทํ•œ์ง€ ๊ณ„์‚ฐ
  2. ์˜๋ฏธ ๊ฒ€์ƒ‰ (Semantic Search): ์ฟผ๋ฆฌ์™€ ๊ฐ€์žฅ ๊ด€๋ จ ์žˆ๋Š” ๋ฌธ์„œ ์ฐพ๊ธฐ
  3. ํ…์ŠคํŠธ ํด๋Ÿฌ์Šคํ„ฐ๋ง: ๋น„์Šทํ•œ ๋ฌธ์žฅ๋“ค์„ ๊ทธ๋ฃนํ™”
  4. ์ค‘๋ณต ํƒ์ง€: ์œ ์‚ฌํ•œ ํ…์ŠคํŠธ ์ฐพ๊ธฐ
  5. ์ถ”์ฒœ ์‹œ์Šคํ…œ: ์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ
  6. ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ: ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ๋œ ๋‹ต๋ณ€ ์ฐพ๊ธฐ
  7. ๋ฌธ์„œ ๋ถ„๋ฅ˜: ํ…์ŠคํŠธ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜

๊ธฐ์ˆ  ์ƒ์„ธ

์•„ํ‚คํ…์ฒ˜

Input Text (ํ•œ๊ตญ์–ด ๋ฌธ์žฅ)
    โ†“
Tokenization (ํ† ํฐํ™”)
    โ†“
BERT Encoder (768์ฐจ์›)
    โ†“
Mean Pooling (ํ‰๊ท  ํ’€๋ง)
    โ†“
Linear Layer (768 โ†’ 384)
    โ†“
Layer Normalization
    โ†“
L2 Normalization
    โ†“
Output Embedding (384์ฐจ์›)

ํ•™์Šต ์ •๋ณด

  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: jhgan/ko-sbert-multitask
  • ํ•™์Šต ๋ฐฉ๋ฒ•: Similarity Matrix Distillation
  • ์†์‹ค ํ•จ์ˆ˜: MSE on Cosine Similarity Matrices
  • ์ตœ์ ํ™”: Adam (lr=0.001)
  • ์ •๊ทœํ™”: L2 Normalization
  • ์ตœ๋Œ€ ๊ธธ์ด: 128 tokens

์„ฑ๋Šฅ ๋น„๊ต

์ฐจ์› ์†๋„ ๋ฉ”๋ชจ๋ฆฌ ์œ ์‚ฌ๋„ ๋ณด์กด
768 (์›๋ณธ) 1.0x 100% 100%
384 (๋ณธ ๋ชจ๋ธ) ~2.0x 50% 99.2%

๋ฐ์ดํ„ฐ์…‹

๋‹ค์Œ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • KorNLI (์ž์—ฐ์–ด ์ถ”๋ก )
  • KorSTS (์˜๋ฏธ ์œ ์‚ฌ๋„)
  • NSMC (์˜ํ™” ๋ฆฌ๋ทฐ)
  • KorQuAD (์งˆ์˜์‘๋‹ต)
  • ์ถ”๊ฐ€ ํ•œ๊ตญ์–ด ๋Œ€ํ™” ๋ฐ์ดํ„ฐ

ํ˜ธํ™˜์„ฑ

์ง€์› ๋ฒ„์ „

  • โœ… sentence-transformers >= 2.0.0
  • โœ… transformers >= 4.0.0
  • โœ… PyTorch >= 1.6.0

๋ชจ๋“  ํ‘œ์ค€ ๋ฉ”์„œ๋“œ ์ง€์›

model = SentenceTransformer('YOUR_USERNAME/ko-sbert-384-reduced')

# ๋‹ค์–‘ํ•œ ์˜ต์…˜ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
embeddings = model.encode(sentences)
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings = model.encode(sentences, show_progress_bar=True)
embeddings = model.encode(sentences, batch_size=32)
embeddings = model.encode(sentences, normalize_embeddings=True)

๋ผ์ด์„ ์Šค

Apache 2.0 License

์ธ์šฉ

์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์‹œ๋Š” ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”:

@misc{ko-sbert-384-reduced,
  author = {Your Name},
  title = {Korean Sentence-BERT 384d (Dimension-Reduced)},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/ko-sbert-384-reduced}}
}

๋ฌธ์˜

๋ฌธ์ œ๋‚˜ ์ œ์•ˆ์‚ฌํ•ญ์ด ์žˆ์œผ์‹œ๋ฉด ๋ชจ๋ธ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์— ์ด์Šˆ๋ฅผ ๋‚จ๊ฒจ์ฃผ์„ธ์š”.


์ƒ์„ฑ์ผ: 2025-10-15
ํ”„๋ ˆ์ž„์›Œํฌ: PyTorch
๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: sentence-transformers
์ฐจ์› ์ถ•์†Œ: 768 โ†’ 384 (99.25% ๋ณด์กด)

Downloads last month
193
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support