Papers
arxiv:2512.03514

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Published on Dec 3
· Submitted by Adithya S K on Dec 8

Abstract

M3DR is a multilingual multimodal document retrieval framework using contrastive training to achieve robust cross-lingual and cross-modal alignment across diverse languages and document types.

AI-generated summary

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.

Community

Paper author Paper submitter

Can we build universal document retrievers that maintain strong results across typologically diverse languages without losing English performance.

This question led us to design synthetic training data and multilingual benchmarks to teach a model to match documents across scripts and formats.

We are excited to launch NetraEmbed our SoTA model for multimodal multilingual document retrieval along with the M3DR: Towards Universal Multilingual Multimodal Document Retrieval paper. The release includes the NetraEmbed model which produces a single dense embedding with matryoshka support at 768,1536 and 2560 dimensions and the ColNetraEmbed model which produces patch level multivector embeddings. Both models are finetuned on Gemma3-4B-it and gained ~150% improvement over baselines. To measure progress we also built the NayanaIR Benchmark with 22 multilingual and 1 cross lingual dataset and documented the full framework in the M3DR paper.

Links
Blog: https://www.cognitivelab.in/blog/introducing-netraembed

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1