Instructions to use Irfanuruchi/gemma-3-4b-it-mlx-5bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Irfanuruchi/gemma-3-4b-it-mlx-5bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Irfanuruchi/gemma-3-4b-it-mlx-5bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use Irfanuruchi/gemma-3-4b-it-mlx-5bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Irfanuruchi/gemma-3-4b-it-mlx-5bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Irfanuruchi/gemma-3-4b-it-mlx-5bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Irfanuruchi/gemma-3-4b-it-mlx-5bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Gemma 3 4B Instruct – MLX 5-bit (Apple Silicon)
This repository provides a 5-bit quantized MLX version of Gemma 3 4B Instruct, optimized for efficient local inference on Apple Silicon (M1–M5).
Highlights
- 5-bit quantization (better quality than standard 4-bit)
- Fast inference on Apple Silicon (MLX backend)
- Good reasoning and instruction-following
- Low memory usage (~2.7 GB peak)
Performance (M3 Pro, 18GB RAM)
- Generation speed: ~46 tokens/sec
- Peak memory: ~2.7 GB
- Model size: ~2.5 GB
Usage
Install MLX
pip install mlx mlx-lm
mlx_lm.generate \
--model ./gemma-3-4b-it-mlx-5bit \
--prompt "Explain HVAC airflow calculation in simple terms."
Example
Input
A room needs 12000 BTU/h cooling. If a system uses about 400 CFM per ton, estimate the airflow needed.
Expected reasoning:
12000 BTU/h = 1 ton → airflow ≈ 400 CFM
License and Attribution
This model is a derivative work based on:
Google Gemma 3 4B Instruct
Original model: https://huggingface.co/google/gemma-3-4b-it License: Gemma Terms of Use https://ai.google.dev/gemma/terms
Modifications
This repository includes the following modifications:
Converted to MLX format Quantized to 5-bit precision Optimized for Apple Silicon inference
Notice
Gemma is provided under and subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms
Disclaimer
This is an independently modified version of the original model. Google is not responsible for this version or its outputs.
Credits
Google – for the Gemma model MLX team – for Apple Silicon inference framework
- Downloads last month
- 6
5-bit