Instructions to use mlx-community/MiniMax-M2.1-3bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/MiniMax-M2.1-3bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/MiniMax-M2.1-3bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Transformers
How to use mlx-community/MiniMax-M2.1-3bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("mlx-community/MiniMax-M2.1-3bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- vLLM
How to use mlx-community/MiniMax-M2.1-3bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mlx-community/MiniMax-M2.1-3bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/MiniMax-M2.1-3bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mlx-community/MiniMax-M2.1-3bit
- SGLang
How to use mlx-community/MiniMax-M2.1-3bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mlx-community/MiniMax-M2.1-3bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/MiniMax-M2.1-3bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mlx-community/MiniMax-M2.1-3bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/MiniMax-M2.1-3bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Pi new
How to use mlx-community/MiniMax-M2.1-3bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/MiniMax-M2.1-3bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/MiniMax-M2.1-3bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/MiniMax-M2.1-3bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/MiniMax-M2.1-3bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/MiniMax-M2.1-3bit
Run Hermes
hermes
- MLX LM
How to use mlx-community/MiniMax-M2.1-3bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/MiniMax-M2.1-3bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/MiniMax-M2.1-3bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/MiniMax-M2.1-3bit", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use mlx-community/MiniMax-M2.1-3bit with Docker Model Runner:
docker model run hf.co/mlx-community/MiniMax-M2.1-3bit
Anyone running this with M4 Max 128gb? How does it compare to 4bit quantization?
Thanks for pushing this model, I have seen in the 4bit quantization which would be my standard goto version that MiniMax M2.1 is too large to fit in 128Gb and that it still seems to have issues with the thinking templates. I am wondering if anyone has successfully tried this with 128Gb of shared memory and if there are any issues. Also what’s the tokens/s you are getting and with which infrastructure
Unsloth says they made some fixes to the chat template. Their jinja template is found at https://huggingface.co/unsloth/MiniMax-M2.1/blob/main/chat_template.jinja
Would you be able to test it with their template to see if that template solves your issue please?
This model is 100 GB in size.
You might have to type something like sudo sysctl iogpu.wired_limit_mb=117760 in the terminal to tell MacOS you want to use 115 GB for the GPU memory. You could even try 122880 for 120 GB?
Thanks I will do once I got a decent internet connection, I am traveling at the moment and it won’t work with 100Gb download 😉 I’ll report back after the test.
Perhaps retry it with the updated tokenizer_config.json from the 4-bit version. This was updated yesterday. For more information, see the discussion at https://huggingface.co/mlx-community/MiniMax-M2.1-4bit/discussions/3
Thanks, I have tried the 3bit version with the unsloth template and it works, I can get 20t/s at the beginning going to 5t/s for longer prompts. It’s usable but slow, at least for continued development. I’ll try the other template and report. Overall the memory footprint is 97.6Gb for loading and with the KV cache is getting another 12-13Gb after a few exchanges.
Very good.
Thank you for the feedback!