Instructions to use athirdpath/Llama-3-15b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use athirdpath/Llama-3-15b-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="athirdpath/Llama-3-15b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("athirdpath/Llama-3-15b-Instruct")
model = AutoModelForCausalLM.from_pretrained("athirdpath/Llama-3-15b-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use athirdpath/Llama-3-15b-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "athirdpath/Llama-3-15b-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "athirdpath/Llama-3-15b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/athirdpath/Llama-3-15b-Instruct

SGLang

How to use athirdpath/Llama-3-15b-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "athirdpath/Llama-3-15b-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "athirdpath/Llama-3-15b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "athirdpath/Llama-3-15b-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "athirdpath/Llama-3-15b-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use athirdpath/Llama-3-15b-Instruct with Docker Model Runner:
```
docker model run hf.co/athirdpath/Llama-3-15b-Instruct
```

Does the model actually work?

by rombodawg - opened May 4, 2024

Discussion

rombodawg

May 4, 2024

Does it even inference correctly?
I believe that even models meant to be further finetuned after merging like this are effective if they properly inference after the merge like my model here:
https://huggingface.co/Replete-AI/Llama-3-11.5B-Instruct-V2

athirdpath

Owner May 4, 2024

There are a few post-merge finetunes on my page already from the last few days, I'd suggest using this for new finetunes and using those for inference.

athirdpath/Llama-3-15b-Instruct-GLUED for most purposes, athirdpath/Llama-3-15b-Instruct-GLUED-Plus for eRP

rombodawg

May 4, 2024

•

edited May 4, 2024

@athirdpath Those models seem like they are already trained. You are missing the point. What I'm saying is the model would perform much better after training if it can be inferenced immedietly after the passthrough merge without issues. Meaning it retains much of its intelligence and would be a much more successful finetune. The model I linked can fully be used without the need for finetuning even though it went through passthrough, meaning that after finetuning, it would perform much better than even this 15b model despite my model being 11.5b size.
At least that is my working theory

rombodawg

May 4, 2024

So if you can figure out how to make a 15b parameter passthrough merge that can be properly inferenced without loss after passthrough, then the finetune afterwards would be much more successful.

athirdpath

Owner May 4, 2024

Ah, I understand.

This does work, but like the early stages of my Iambe models, it has a few oddities. I arraigned the layers such that there's no way it could ever maintain all of its smarts, but that's not my goal with this line.

These 15b models exist to test a theory of mine, that by manipulating mostly the 2/5 and 3/5 of the model, you can adjust how it analyzes narratives and chats.

My 11b model does however work great out of the box, it can even do math. All my tests with expanding L3 the classical way to >11b didn't demonstrate any performance improvement but had higher compute costs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment