Instructions to use mistralai/Mistral-7B-Instruct-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mistralai/Mistral-7B-Instruct-v0.2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mistralai/Mistral-7B-Instruct-v0.2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Install mistral-common:
pip install --upgrade mistral-common
# Start the vLLM server:
vllm serve "mistralai/Mistral-7B-Instruct-v0.2" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2

SGLang

How to use mistralai/Mistral-7B-Instruct-v0.2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mistralai/Mistral-7B-Instruct-v0.2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mistralai/Mistral-7B-Instruct-v0.2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.2 with Docker Model Runner:
```
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
```

Mistral-7B-Instruct-v0.2 loopy text generation with custom chat template

#68

by ercanucan - opened Mar 18, 2024

Discussion

ercanucan

Mar 18, 2024

•

edited Mar 18, 2024

Dear community,

We are using Mistral-7B-Instruct-v0.2 (off-the-shelf, no fine-tuning etc.) with a chat template in order to accept the System prompts (as user input) from various GUI chat clients such as ChatBox (https://github.com/Bin-Huang/chatbox) and ChatGPT-lite (https://github.com/blrchen/chatgpt-lite).

Problem:

The issue we observe with this chat template together with Mistral is the following: Often times (after a couple of chat turns), Mistral starts generating repetitive response and goes on for a very long time, as if it does not know when to stop. It tends to show this behavior especially on basic questions such as hi, who are you?, and tell me about yourself. Has anyone experienced such a behavior? Do you spot any potential issue with the template we are using? Any hints here would be highly appreciated!

The chat-template is below, which is the same as https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja only with an adaptation to remove the redundant newlines:

{% if messages[0]['role'] == 'system' -%}
    {% set loop_messages = messages[1:] -%}
    {% set system_message = messages[0]['content'].strip() + '\n\n' -%}
{% else -%}
    {% set loop_messages = messages -%}
    {% set system_message = '' -%}
{% endif -%}
{{ bos_token -}}
{% for message in loop_messages -%}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif -%}
    {% if loop.index0 == 0 -%}
        {% set content = system_message + message['content'] -%}
    {% else -%}
        {% set content = message['content'] -%}
    {% endif -%}
    {% if message['role'] == 'user' -%}
        {{ '[INST] ' + content.strip() + ' [/INST]' -}}
    {% elif message['role'] == 'assistant' -%}
        {{ ' '  + content.strip() + ' ' + eos_token -}}
    {% endif -%}
{% endfor -%}

Many thanks for your help!

OPPEYRADY

Apr 3, 2024

•

edited Apr 5, 2024

Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens

UPDATE: Ran a test, Instruct v1 does not seem to have this issue.

Goldenblood56

Apr 4, 2024

Right I've ran into this it feels like with Ooba it got worse 2-3 months ago. But maybe that's just me? I think a certain update made this issue worse.

GuMMYY

Apr 12, 2024

Sliding window attention matters

Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens

UPDATE: Ran a test, Instruct v1 does not seem to have this issue.

Goldenblood56

May 1, 2024

Sliding window attention matters

Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens

UPDATE: Ran a test, Instruct v1 does not seem to have this issue.

What do you mean? Do I want it on or off? How do I tell? How do I change it? It is baked into the Quant or is it a setting? Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment