Instructions to use hivemind/gpt-j-6B-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hivemind/gpt-j-6B-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="hivemind/gpt-j-6B-8bit")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("hivemind/gpt-j-6B-8bit") model = AutoModelForCausalLM.from_pretrained("hivemind/gpt-j-6B-8bit") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use hivemind/gpt-j-6B-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hivemind/gpt-j-6B-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hivemind/gpt-j-6B-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/hivemind/gpt-j-6B-8bit
- SGLang
How to use hivemind/gpt-j-6B-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "hivemind/gpt-j-6B-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hivemind/gpt-j-6B-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "hivemind/gpt-j-6B-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hivemind/gpt-j-6B-8bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use hivemind/gpt-j-6B-8bit with Docker Model Runner:
docker model run hf.co/hivemind/gpt-j-6B-8bit
load_in_8bit fine-tuning requires more memory than this notebook
I found, and was using this example before I found out about load_in_8bit. It worked and I was able to fine-tune the model on colab.
After fine-tuning and save_pretrained, I realised that I was unable to load the fine-tuned model in another notebook using from_pretrained discovering that there were version issues with pytorch and transformers.
I've been trying to use load_in_8bit to fine-tune however it fills the gpu memory and crashes as soon as the training loop starts.
What's the difference between this notebook and load_in_8bit?
Is it LoRA, and how could this be implemented with load_in_8bit?
Thanks
TL;DR
- load_in_8bit does forward pass faster, especially for small batches || this implementation is slower because it needs to de-quantize weights, while load_in_8bit runs forward pass with quantized weights
- load_in_8bit currently requires Turing GPUs or newer (e.g. colab T4 or 2080 are fine, colab K80 or 1080Ti are not) || this implementation works with any GPU or CPU
- load_in_8bit currently supports only forward pass, i.e. no finetuning, BUT they are working on LoRA implementation there and will post update in a few weeks.
Is it LoRA, and how could this be implemented with load_in_8bit?
Currently, it requires some coding:
- please install the latest bitsandbytes (i.e. this week's version)
- write a LoRA wrapper around bnb.nn.Linear8bitLt
-- in this wrapper, make sure you pass has_fp16_weights=True and memory_efficient_backward=True (see example test) - use your wrapped layer instead of standard bnb.nn.Linear8bitLt
Or wait for a couple of weeks till bnb and HF guys do that for you ;)