Image captioning for French with pre-trained vision and text model

valhalla · June 23, 2021, 9:09am

Image captioning with pre-trained vision and text model

For this project, a pre-trained image model like ViT can be used as an encoder, and a pre-trained text model like BERT and/or GPT2 can be used as a decoder.

Model

Pre-trained ViT, BERT, and GPT2 models can be found on the model hub

Datasets

The WIT dataset can be used for this task. It has almost over 1M image-text examples for the French language.

Available training scripts

As this will be a Seq2Seq model, the run_summarization_flax.py script can be used for training this model with some modifications.

(Optional) Desired project outcome

The desired outcome is to see if pre-trained vision and text models can be leveraged for image captioning and also train captioning models in French languages. This can be showcased directly with a streamlit or gradio app.

(Optional) Challenges

This model will require some modifications to the existing text models. Specifically, as this will be a seq2seq model, we’ll need to add a randomly initialized cross-attention layer in BERT or GPT2 to use it as a decoder in the encoder-decoder setting.

(Optional) Links to read upon

This keras example presents an excellent example of how image encoder and transformer can be used for image captioning.

ydshieh · July 1, 2021, 2:33pm

Hi, I am a bit late for deciding the project to work on. This one seems a good fit for me. Is it possible?
And one more question, there is another project named “Multilingual Image Captioning”. It has already 8 participants, so I don’t expect to be a new member for that project. However, would it be possible to join their discussions?

ydshieh · July 1, 2021, 2:50pm

I leave a comment on the google sheet as requested by @patrickvonplaten

patrickvonplaten · July 1, 2021, 11:57pm

Would be great if we could find another participant here Will see if we can make this a single-person project tomorrow morning

patrickvonplaten · July 2, 2021, 3:31pm

Let’s define it!

ydshieh · July 3, 2021, 9:11pm

@valhalla @patrickvonplaten It seems CamemBERT doesn’t have a Flax version. GPT2 Fr models don’t have neither.

Should I write a Flax version for CamemBERT and try to reuse PyTorch weights for it?

toyl · January 4, 2022, 1:04pm

excuse me does the decoder of the language model deal with words or sentences to do the captioning . i mean did you use BERT for word embedding or sentence embedding … Thanks in advance

Topic		Replies	Views
Image captioning for Japanese with pre-trained vision and text model Flax/JAX Projects	0	1193	June 23, 2021
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2604	July 19, 2021
Image Captioning - ViT + BERT with WIT Flax/JAX Projects	2	4123	October 21, 2021
Image captioning for Indonesia with pre-trained vision Flax/JAX Projects	4	531	June 29, 2021
Multilingual Image Captioning Flax/JAX Projects	10	1325	July 6, 2021