load_in_8bit fine-tuning requires more memory than this notebook

by petermills - opened Sep 23, 2022

Sep 23, 2022

I found, and was using this example before I found out about load_in_8bit. It worked and I was able to fine-tune the model on colab.
After fine-tuning and save_pretrained, I realised that I was unable to load the fine-tuned model in another notebook using from_pretrained discovering that there were version issues with pytorch and transformers.
I've been trying to use load_in_8bit to fine-tune however it fills the gpu memory and crashes as soon as the training loop starts.
What's the difference between this notebook and load_in_8bit?
Is it LoRA, and how could this be implemented with load_in_8bit?

Thanks

justheuristic

hivemind org Sep 26, 2022

•

edited Sep 26, 2022

TL;DR

load_in_8bit does forward pass faster, especially for small batches || this implementation is slower because it needs to de-quantize weights, while load_in_8bit runs forward pass with quantized weights
load_in_8bit currently requires Turing GPUs or newer (e.g. colab T4 or 2080 are fine, colab K80 or 1080Ti are not) || this implementation works with any GPU or CPU
load_in_8bit currently supports only forward pass, i.e. no finetuning, BUT they are working on LoRA implementation there and will post update in a few weeks.

Is it LoRA, and how could this be implemented with load_in_8bit?

Currently, it requires some coding:

please install the latest bitsandbytes (i.e. this week's version)
write a LoRA wrapper around bnb.nn.Linear8bitLt
-- in this wrapper, make sure you pass has_fp16_weights=True and memory_efficient_backward=True (see example test)
use your wrapped layer instead of standard bnb.nn.Linear8bitLt

Or wait for a couple of weeks till bnb and HF guys do that for you ;)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment