Instructions to use athirdpath/Llama-3-15b-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use athirdpath/Llama-3-15b-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="athirdpath/Llama-3-15b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("athirdpath/Llama-3-15b-Instruct") model = AutoModelForCausalLM.from_pretrained("athirdpath/Llama-3-15b-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use athirdpath/Llama-3-15b-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "athirdpath/Llama-3-15b-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "athirdpath/Llama-3-15b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/athirdpath/Llama-3-15b-Instruct
- SGLang
How to use athirdpath/Llama-3-15b-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "athirdpath/Llama-3-15b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "athirdpath/Llama-3-15b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "athirdpath/Llama-3-15b-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "athirdpath/Llama-3-15b-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use athirdpath/Llama-3-15b-Instruct with Docker Model Runner:
docker model run hf.co/athirdpath/Llama-3-15b-Instruct
Does the model actually work?
Does it even inference correctly?
I believe that even models meant to be further finetuned after merging like this are effective if they properly inference after the merge like my model here:
https://huggingface.co/Replete-AI/Llama-3-11.5B-Instruct-V2
There are a few post-merge finetunes on my page already from the last few days, I'd suggest using this for new finetunes and using those for inference.
athirdpath/Llama-3-15b-Instruct-GLUED for most purposes, athirdpath/Llama-3-15b-Instruct-GLUED-Plus for eRP
@athirdpath Those models seem like they are already trained. You are missing the point. What I'm saying is the model would perform much better after training if it can be inferenced immedietly after the passthrough merge without issues. Meaning it retains much of its intelligence and would be a much more successful finetune. The model I linked can fully be used without the need for finetuning even though it went through passthrough, meaning that after finetuning, it would perform much better than even this 15b model despite my model being 11.5b size.
At least that is my working theory
So if you can figure out how to make a 15b parameter passthrough merge that can be properly inferenced without loss after passthrough, then the finetune afterwards would be much more successful.
Ah, I understand.
This does work, but like the early stages of my Iambe models, it has a few oddities. I arraigned the layers such that there's no way it could ever maintain all of its smarts, but that's not my goal with this line.
These 15b models exist to test a theory of mine, that by manipulating mostly the 2/5 and 3/5 of the model, you can adjust how it analyzes narratives and chats.
My 11b model does however work great out of the box, it can even do math. All my tests with expanding L3 the classical way to >11b didn't demonstrate any performance improvement but had higher compute costs.