How did you get it to train with Axolotl on a Pro 6000?

#1
by OwenArli - opened

Can't seem to not OOM even on 2x Pro 6000 trying to train with Axolotl. Curious how this model was trained if you are willing to share. Thanks!

My dataset is about 2 million tokens in length. With these settings it takes about 78GB of VRAM at a maximum.

base_model: zai-org/GLM-4.5-Air
load_in_4bit: true
load_in_8bit: false
bnb_4bit_use_double_quant: false
qlora_sharded_model_loading: false
datasets:

  • path: /dataset.json
    type: alpaca

val_set_size: 0.1
output_dir: ./outputs/lora-out
adapter: qlora
lora_model_dir:
sequence_len: 16384
sample_packing: false
eval_sample_packing: false
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:

  • gate_proj
  • down_proj
  • up_proj
  • q_proj
  • v_proj
  • k_proj
  • o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_clipping: 1.0
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_8bit
lr_scheduler: constant
learning_rate: 0.000008
bf16: auto
tf32: false
gradient_checkpointing: true
activation_offloading: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: false
sdp_attention: true
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
dtype: bfloat16
low_cpu_mem_usage: true
special_tokens:
pad_token: "<|end_of_text|>"
save_first_step: true

Oh nice! Thanks for sharing, I eventually figured out a config that works for me too! I can apparently do 32K context with 2 cards and Deepspeed Zero 2.

Sign up or log in to comment