Why are LVLMs bad at picking up on hints? : Probing the Grounding Gap in Vision-Language Models

Community Article Published November 16, 2025

image

Human communication is inherently contextual; for example, exclaiming “Hey, this is a disaster!” upon seeing a cluttered room conveys frustration or exaggeration rather than referring to an actual catastrophe. Without surrounding cues, textual dialogues can be ambiguous, making it difficult for models to accurately capture intent and nuance.

This is a basic human skill, part of what researchers call Multimodal Theory of Mind (ToM). We blend language, vision, and shared context to infer each other's intentions.

For today's powerful Large Vision-Language Models (LVLMs), this is still a massive challenge.

While models nowadays are great at answering direct questions ("Where is the red car in the image?"), they often fail at indirect, ambiguous, human-like prompts.

Previous benchmarks, like VAGUE (ICCV 2025, [https://arxiv.org/abs/2411.14137]), have shown that models struggle with this. But they only tell us that the model gets the final answer wrong. They don't tell us why.

Does the model fail because it's "hallucinating"? Or does it fail because it isn't even looking at the right object in the first place?

We wanted to answer that question. Our work, "Beyond VAGUE," introduces a new dataset and a probing framework to move past simple accuracy scores and analyze a model's internal reasoning process. We're essentially building a diagnostic tool to see what the model is "thinking."


A New "Microscope": The VAGUE-Ground Dataset

To see if a model is looking at the right object, you first need to know what the "right object" is.

We created VAGUE-Ground, a new benchmark dataset built on top of VAGUE. We meticulously annotated the target object for each ambiguous prompt with a high-fidelity, pixel-perfect segmentation mask. This mask is our "ground truth" for where the model should be paying attention.

Creating this dataset was a multi-stage, semi-automated process:

Propose: We first used a Grounding-DINO model to generate initial bounding box proposals for the target object (e.g., "cup").

Verify (Human-in-the-Loop): Human annotators rigorously checked every single box, filtering out bad proposals or ambiguous, scene-level targets (like "the room").

Segment: We fed these human-verified boxes into the Segment Anything Model (SAM) to generate high-quality, pixel-level masks.

Verify Again: Our annotators did a final quality check on the masks to ensure their accuracy.

The final dataset includes 1,301 instances (from VCR and Ego4D), each pairing an ambiguous expression with a human-verified segmentation mask of its intended target.


The Experiment: Do Models Really Ground Ambiguity?

With our new dataset, we could finally run the experiment. We wanted to compare how a model's internal attention changes when it's given an easy prompt versus an ambiguous one.

We used the popular LLaVA-1.5 (7B) model. Our process had three stages:

Find the "Grounding Layer": First, we had to find the part of the model that's responsible for "looking" at objects. We fed the model the direct prompt (e.g., "the cup") and analyzed all its attention layers. We identified the specific layers and heads (e.g., layer 20, head 7) that were best at "lighting up" on the correct object, matching our ground-truth masks. This gives us our baseline.

Run the "Ambiguous" Test: Now, we showed the model the same image but with the indirect, ambiguous prompt (e.g., "I need that one").

Compare the Focus: We then measured the Intersection over Union (IoU) between the model's attention map and our ground-truth mask. In simple terms: How much did the model's "focus" overlap with the correct object?


What We Found: "Looking Harder, But Seeing Less"

The results were fascinating and confirmed our hypothesis.

1. Models are "Superficial" Word-Matchers

Qualitatively, we found that the model's attention is overwhelmingly driven by explicit words in the prompt.

When the prompt says "the cup", the model's attention snaps right to the cup.

When the prompt says "that one," the model's attention scatters. It gets confused and just looks for any salient object, proving it didn't understand the speaker's intent.

This confirms that the "superficial understanding" reported by VAGUE can be directly observed in the model's internal mechanisms.

2. Models "Know" They're Confused

We also measured the total amount of attention the model paid to the image (versus the text).

We found that when given the indirect prompt, the model's total image attention consistently increased.

This suggests the model "knows" the text isn't very helpful, so it tries to compensate by "looking harder" at the visual information. But looking harder doesn't mean looking smarter.

3. The Grounding Gap: A 26% Drop

Here's the key number.

Even though the model was "looking harder," its ability to find the correct object plummeted.

Direct Prompt IoU: 0.058 (Baseline)

Indirect Prompt IoU: 0.043

This represents a 25.86% degradation in localization performance. The model is paying more attention to the image but is more confused about what to look at.


Why This Matters

This work moves beyond just testing for the right or wrong answer. It provides the essential tools to verify, analyze, and quantify the visual grounding capabilities that underpin true human-AI understanding.

If we ever want to build AI agents that can fluidly collaborate with us—robots that can "pass me that one" in a cluttered workshop or software that can "find the other examples" in a complex design file—we first need to bridge this grounding gap.

Our VAGUE-Ground benchmark and probing methodology offer a clear path for researchers to diagnose why their models fail at multimodal reasoning and, ultimately, to build systems that don't just match patterns, but actually understand our intent.


Link to the VAGUE-Gound HF Dataset : https://huggingface.co/datasets/HazelNam/vague_mask

Link to the original VAGUE (VQA-style) Dataset : https://huggingface.co/datasets/HazelNam/vague-bench

Extended version (Beyond VAGUE: Attention Analysis for Probing How VLMs Ground Ambiguity) is under-review @ AAAIw MMToM

Link to the original VAGUE paper : https://arxiv.org/abs/2411.14137

Community

Sign up or log in to comment