Instructions to use meta-llama/Meta-Llama-3-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meta-llama/Meta-Llama-3-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meta-llama/Meta-Llama-3-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meta-llama/Meta-Llama-3-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct

SGLang

How to use meta-llama/Meta-Llama-3-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meta-llama/Meta-Llama-3-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meta-llama/Meta-Llama-3-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meta-llama/Meta-Llama-3-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/meta-llama/Meta-Llama-3-8B-Instruct
```

Warning: The attention mask and the pad token id were not set..

#40

by Stephen-SMJ - opened Apr 20, 2024

Discussion

Stephen-SMJ

Apr 20, 2024

Hi, when I infer the llama3-8b-instruct, there is an warning:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.

Although it doesn't impact the result, I still want to fix the warning. Any idea?
Thanks for answering.

rumAndRegression

Apr 20, 2024

when calling model.generate, setting pad_token_id=tokenizer.eos_token_id seems to remove the warning.

https://stackoverflow.com/questions/69609401/suppress-huggingface-logging-warning-setting-pad-token-id-to-eos-token-id

it seems to work fine still with that, although some explanation and reasoning wouldn't hurt. maybe i saw one when searching for the solution but i forget.. :)

LAKSERS

Apr 21, 2024

model.generate(**encoded_input, pad_token_id=tokenizer.eos_token_id).just change this will be ok.

nnilayy

Aug 26, 2024

•

edited Aug 26, 2024

Hey there, i was also facing the same error and after tinkering with the tokenizer for a bit i found out that inputs = tokenizer.encode(prompt) returns just the input_ids but not the attention mask, Whereas inputs = tokenizer(prompt) returns both the input_ids and the attention mask.

So if you replace this code

inputs = tokenizer.encode(prompt)
output = model.generate(inputs)

With this

inputs = tokenizer(prompt)
output = model.generate(**inputs)

The error warning goes away. Hope that helps 😊.

Stephen-SMJ changed discussion status to closed Sep 14, 2024

ernestyalumni

Oct 3, 2024

•

edited Oct 3, 2024

I had the same warning as well, and it took me looking at the huggingface transformers code quite a bit but was able to come to a solution:
original warning: The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.

If you choose to use apply_chat_template() with your tokenizer, an instance of PreTrainedTokenizer for instance, then set return_dict=True, e.g.
return_output = tokenizer.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True).to(model.device)

I do the to(model.device) at the end to move those (PyTorch?) tensors into the model's device (e.g. "cuda").

return_output is a dict of 2 keys: "input_ids" and "attention_mask". Use both when you run generate(..) for example:

        return model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=generation_configuration.max_new_tokens,
            do_sample=do_sample,
            top_k=generation_configuration.top_k,
            top_p=generation_configuration.top_p,
            temperature=generation_configuration.temperature,
            eos_token_id=eos_token_id,
            streamer=streamer)

and that gets called in my function:

        with torch.no_grad():
            generate_output = run_model_generate(
                input_ids=return_output["input_ids"],
                model=model,
                streamer=streamer,
                eos_token_id=generation_configuration.eos_token_id,
                generation_configuration=generation_configuration,
                attention_mask=return_output["attention_mask"])

My code here: https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreTransformers/executable_scripts/terminal_only_infinite_loop_instruct.py (I'm trying to build out my own library so I'm calling a number of wrappers I made)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment