Minimal number of sequences for fine tuning

by ShanGao - opened Aug 9, 2022

Aug 9, 2022

Dear Authors,

Thanks for your excellent work! I plan to fine tune the model. Can you please advice on the minimal number of protein sequences needed for fine tuning? Thanks a lot!

nferruz

Owner Aug 9, 2022

Hi Yuanji Zhang,

Thanks for your interest! I am afraid I do not have a rule of thumb yet. I tried to fine-tune a model with 500 sequences, and it did not work very well. However, I know someone who fine-tuned 900 sequences, and the training curves looked fine and obtained the expected results. So I guess you will have to try :). I am happy to assist if you need any help!

Noelia

ShanGao

Aug 23, 2022

Hi Noelia,

I tested the example code from zenodo as

protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
sequences = protgpt2("M", max_length=100, min_length=80, ...)

The actual length of generated protein sequences is from 239..298. Are "max_length" and "min_length" actually the number of tokens?

nferruz

Owner Aug 23, 2022

Yes, min and max length correspond to the number of tokens

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment