Spicy Motivator - PPO
ํ๊ตญ์ด ๋ช ์ธ์ ๋น๊ผฌ๋ ๋ฌธ์ฅ์ผ๋ก ๋ณํํ๋ ๋ชจ๋ธ (PPO/REINFORCE๋ก ํ์ต)
๋ชจ๋ธ ์ค๋ช
- Base Model: meta-llama/Llama-3.1-8B
- ํ์ต ๋ฐฉ๋ฒ: PPO (REINFORCE with heuristic reward)
- LoRA: r=16, alpha=32
- Reward Function: ๋น๊ผฌ๋ ํค์๋ + ๊ธธ์ด + ๋ค์์ฑ ๊ธฐ๋ฐ ํด๋ฆฌ์คํฑ
์ฌ์ฉ๋ฒ
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Base ๋ชจ๋ธ ๋ก๋
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# LoRA ์ด๋ํฐ ๋ก๋
model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/spicy-motivator-ppo")
# ์์ฑ
prompt = "### ๋ช
์ธ: ๋
ธ๋ ฅ์ ๋ฐฐ์ ํ์ง ์๋๋ค.\n### ๋น๊ผฌ๋ ๋ต๋ณ:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
ํ๋ก์ ํธ ์ ๋ณด
- ์ถฉ๋จ๋ํ๊ต ๊ฐํํ์ต ์์ ํ ํ๋ก์ ํธ
- PPO vs DPO ๋น๊ต ์ฐ๊ตฌ
- Downloads last month
- 1
Model tree for Guardrium/spicy-motivator-ppo
Base model
meta-llama/Llama-3.1-8B