LLMsMedium impactFor DevarXiv LLMs · June 12, 2026

Gaze Heads: How VLMs Look at What They Describe

Researchers identify specific attention heads, called gaze heads, in vision-language models that track and control the described image regions, allowing targeted steering of model output without retraining.
Signal strength3.4/5·arXiv LLMs

Researchers identify specific attention heads, called gaze heads, in vision-language models that track and control the described image regions, allowing targeted steering of model output without retraining.

TL;DR

Researchers identify specific attention heads, called gaze heads, in vision-language models that track and control the described image regions, allowing targeted steering of model output without retraining.

What happened

The study discovered that a small subset of attention heads in vision-language models' language backbones focus attention on image regions currently being described, termed gaze heads. By intervening on these heads, researchers could steer model descriptions toward chosen visual regions in comics and natural images, demonstrating effective behavior control without retraining. This mechanism was consistent across model sizes and architectures.

Why it matters

This work reveals an interpretable, mechanistic lever inside VLMs for controlling multimodal output precisely, advancing understanding of model internals and enabling more controllable and explainable multimodal AI systems.

Generating deep dive...

AI-powered analysis takes a few seconds