Visual Question Answering
Transformers
Safetensors
English
videorefer_qwen2
text-generation
multimodal large language model
large video-language model
Instructions to use DAMO-NLP-SG/VideoRefer-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DAMO-NLP-SG/VideoRefer-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="DAMO-NLP-SG/VideoRefer-7B")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("DAMO-NLP-SG/VideoRefer-7B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: visual-question-answering
tags:
- multimodal large language model
- large video-language model
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
If you like our project, please give us a star β on Github for the latest update.
π Model Zoo
| Model Name | Visual Encoder | Language Decoder | # Training Frames |
|---|---|---|---|
| VideoRefer-7B | siglip-so400m-patch14-384 | Qwen2-7B-Instruct | 16 |
| VideoRefer-7B-stage2 | siglip-so400m-patch14-384 | Qwen2-7B-Instruct | 16 |
| VideoRefer-7B-stage2.5 | siglip-so400m-patch14-384 | Qwen2-7B-Instruct | 16 |
π Citation
If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
@article{yuan2024videorefersuite,
title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
journal={arXiv},
year={2024},
url = {}
}