Image captioning with pre-trained vision and text model
For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.
Model
Pre-trained ViT, BERT, and GPT2 models can be found on the model hub
Datasets
The WIT dataset can be used for this task. It has almost over 1M image-text examples for the French language.
Available training scripts
As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.
(Optional) Desired project outcome
The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in French languages. This can be showcased directly with a streamlit or gradio app.
(Optional) Challenges
This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.
(Optional) Links to read upon
This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.
Will see if we can make this a single-person project tomorrow morning