Safetensors
Chinese
English
qwen3_vl

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Resolving the Triplet of "When, Who, and What is Said" in Dialogue-Centric Videos

arXiv HuggingFace Project Page GitHub Code License


πŸ”₯ News

  • [2026/02/10] πŸš€ D-ORCA inference code and the model checkpoint are released! We achieve SOTA results on dialogue-centric video understanding among open-sourced models.
  • [2026/02/08] πŸ“„ Our paper is available on arXiv.

πŸ“– Introduction

We introduce D-ORCA, a dialogue-centric omni-modal large language model optimized for robust audio-visual captioning. Unlike existing models that struggle with speaker attribution and temporal alignment, D-ORCA is designed to accurately resolve when, who, and what is said in the video.

Most open-source multimodal LLMs fail to produce accurate, dialogue-aware captions (see Figure 1). D-ORCA addresses this by:

  1. Constructing DVD-Train: A large-scale (40k videos) bilingual dataset tailored for dialogue scenarios.
  2. Novel RL Optimization: Adopting GRPO with three specialized rewards:
    • 🎯 Speaker Attribution Accuracy
    • πŸ“ Global Speech Content Accuracy
    • ⏱️ Sentence-level Temporal Boundary Alignment

Despite having only 8B parameters, D-ORCA outperforms existing open-source models and remains competitive with significantly larger models on general benchmarks.

Comparison with other models
Figure 1: Comparison of D-ORCA with other models. D-ORCA accurately identifies speakers, recognizes speech, and aligns timestamps.

πŸ† Performance

D-ORCA achieves state-of-the-art performance on our curated DVD-Bench.

Model (En) Acc% ↑ (En) WER% ↓ (En) IoU% ↑ (Zh) Acc% ↑ (Zh) WER% ↓ (Zh) IoU% ↑
ARC-Qwen-Video-Narrator (7B) 66.4 65.0 23.0 63.2 53.6 10.1
Qwen2.5-Omni (7B) 62.7 83.6 - 55.7 69.4 -
video-SALMONN 2+ (7B) 66.6 94.0 - 59.9 - -
AVoCaDO (7B) 72.9 17.9 - 69.3 - -
Qwen3-Omni-Instruct (30B) 67.8 91.3 - 63.5 60.6 -
D-ORCA (8B) 81.1 16.6 57.1 78.0 17.5 37.8

D-ORCA also achieves competitive results on general audio-visual benchmarks.

Model Video-MME WorldSense AVUT Video-Holmes DailyOmni AV-SpeakerBench
OmniVinci (7B) 68.6 48.2 - - 66.5 -
ARC-Qwen-Video-Narrator (7B) 62.4 45.1 - 43.2 63.2 40.2
Qwen2.5-Omni (7B) 64.3 47.8 66.3 43.7 62.7 42.3
video-SALMONN 2+ (7B) 73.4 50.9 69.5 46.9 71.8 51.6
AVoCaDO (7B) 65.9 49.9 70.0 47.2 69.8 51.6
Qwen3-Omni-Instruct (30B-A3B) 70.5 54.0 72.0 54.1 69.8 54.1
D-ORCA (8B) 72.9 53.7 76.1 48.5 78.5 55.0

πŸ› οΈ Quick Start

Model Zoo

Model Base LLM Params HuggingFace
D-ORCA-8B-0210 Qwen3-VL-8B 8B Download

Installation

For key library versions, please refer to new_req.txt. Install via:

pip install -r new_req.txt

Inference

For evaluation on a whole dataset:

  1. Prepare the dataset following scripts/example_data.json.
  2. Modify the parameters in scripts/direct_test.sh.
  3. Run bash scripts/direct_test.sh.

For setting up a CLI demo:

  1. Modify the parameters in scripts/direct_demo.sh.
  2. Run bash scripts/direct_demo.sh.
  3. Modify scripts/demo_config.yaml to control input.

πŸ“… Roadmap

  • Release D-ORCA 8B model checkpoints.
  • Release Inference Code.
  • Release DVD-Bench evaluation data.
  • Release Training Code (SFT, pre-DPO, GRPO).
  • Release DVD-Train dataset annotations.

πŸ–ŠοΈ Citation

If you find D-ORCA useful for your research, please cite our paper:

@article{tang2026dorca,
  title={{D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning}}, 
  author={Changli Tang and Tianyi Wang and Fengyun Rao and Jing LYU and Chao Zhang},
  journal={arXiv preprint arXiv:2602.07960},
  year={2026}
}

πŸ“„ License

This project is licensed under the Apache 2.0 License.

Downloads last month
24
Safetensors
Model size
10B params
Tensor type
I64
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tsinghua-ee/D-ORCA-8B-0210

Finetuned
(195)
this model

Paper for tsinghua-ee/D-ORCA-8B-0210