Instructions to use mistralai/Mistral-7B-Instruct-v0.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mistralai/Mistral-7B-Instruct-v0.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use mistralai/Mistral-7B-Instruct-v0.2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Install mistral-common: pip install --upgrade mistral-common # Start the vLLM server: vllm serve "mistralai/Mistral-7B-Instruct-v0.2" --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mistralai/Mistral-7B-Instruct-v0.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
- SGLang
How to use mistralai/Mistral-7B-Instruct-v0.2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mistralai/Mistral-7B-Instruct-v0.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mistralai/Mistral-7B-Instruct-v0.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mistralai/Mistral-7B-Instruct-v0.2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mistralai/Mistral-7B-Instruct-v0.2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mistralai/Mistral-7B-Instruct-v0.2 with Docker Model Runner:
docker model run hf.co/mistralai/Mistral-7B-Instruct-v0.2
Mistral-7B-Instruct-v0.2 loopy text generation with custom chat template
Dear community,
We are using Mistral-7B-Instruct-v0.2 (off-the-shelf, no fine-tuning etc.) with a chat template in order to accept the System prompts (as user input) from various GUI chat clients such as ChatBox (https://github.com/Bin-Huang/chatbox) and ChatGPT-lite (https://github.com/blrchen/chatgpt-lite).
Problem:
The issue we observe with this chat template together with Mistral is the following: Often times (after a couple of chat turns), Mistral starts generating repetitive response and goes on for a very long time, as if it does not know when to stop. It tends to show this behavior especially on basic questions such as hi, who are you?, and tell me about yourself. Has anyone experienced such a behavior? Do you spot any potential issue with the template we are using? Any hints here would be highly appreciated!
The chat-template is below, which is the same as https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja only with an adaptation to remove the redundant newlines:
{% if messages[0]['role'] == 'system' -%}
{% set loop_messages = messages[1:] -%}
{% set system_message = messages[0]['content'].strip() + '\n\n' -%}
{% else -%}
{% set loop_messages = messages -%}
{% set system_message = '' -%}
{% endif -%}
{{ bos_token -}}
{% for message in loop_messages -%}
{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif -%}
{% if loop.index0 == 0 -%}
{% set content = system_message + message['content'] -%}
{% else -%}
{% set content = message['content'] -%}
{% endif -%}
{% if message['role'] == 'user' -%}
{{ '[INST] ' + content.strip() + ' [/INST]' -}}
{% elif message['role'] == 'assistant' -%}
{{ ' ' + content.strip() + ' ' + eos_token -}}
{% endif -%}
{% endfor -%}
Many thanks for your help!
Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens
UPDATE: Ran a test, Instruct v1 does not seem to have this issue.
Right I've ran into this it feels like with Ooba it got worse 2-3 months ago. But maybe that's just me? I think a certain update made this issue worse.
Sliding window attention matters
Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens
UPDATE: Ran a test, Instruct v1 does not seem to have this issue.
Sliding window attention matters
Similar issues with a finetuned model as well, it will sometimes work but othertimes generate massive pitfalls of information till it hits max tokens
UPDATE: Ran a test, Instruct v1 does not seem to have this issue.
What do you mean? Do I want it on or off? How do I tell? How do I change it? It is baked into the Quant or is it a setting? Thanks.