OOM During Finetuning on A100 GPU

#47

by jashton1315 - opened May 23, 2024

May 23, 2024

Hi Noelia,

Thank you very much for sharing this tool! I'm looking forward to applying it to my own projects.

I'm trying to finetune your model on a set of ~2000 sequences. However, even using a 40Gb A100 GPU results in an OOM error, with a batch size of 1.

python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file asyn_train.txt --validation_file asyn_val.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir finetune --learning_rate 1e-06 --per_device_train_batch_size 1 --low_cpu_mem_usage True

RuntimeError: CUDA error: an illegal memory access was encountered

Any help you can provide would be greatly appreciated. Thanks again!
Jonathan

jashton1315

May 23, 2024

Hi Noelia,

I was actually able to fix this issue. For those using a HPC to run finetuning, increasing --ntasks resolved the OOM error.

Thanks!
Jonathan

jashton1315 changed discussion status to closed May 23, 2024

nferruz

Owner May 28, 2024

fantastic, happy to hear !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment