Vision-Language Models for Spatial and Compositional Reasoning

Vision–language models (VLMs) excel at tasks like image classification and captioning, but they often struggle with spatial and compositional reasoning. Spatial reasoning means understanding geometry and object layouts (like “Is the red cube left of the blue sphere?”), while compositional reasoning means correctly interpreting how attributes and relationships combine ( like distinguishing ‘dog chasing cat’ from ‘cat chasing dog’). These capabilities are crucial for applications like robotics, navigation, and scene understanding, yet studies show that even SOTA VLMs fail at them. For example, Wu et al note that large VLMs**“often struggle with spatial reasoning** despite strong basic vision abilities, and Mishra et al show models confusing simple compositions (dog vs. cat chasing). In SpatialEval (NeurIPS 2024), researchers find that spatial tasks are so challenging that VLMs can do worse than chance and even underperform pure language models, relying on text when visual detail is neededhttps://neurips.cc/virtual/2024/poster/94371#:~:text=cognition%E2%80%94remains%20under,become%20less%20reliant%20on%20visual In short, without special training, VLMs tend to get geometry and multi-step object relations wrong

Spatial Reasoning (challenges and failure modes)

Spatial reasoning involves understanding object positions, distances, orientations, and how they change (in a sequence of frames). Current VLMs often lose geometric information when converting images to text. For instance, Wu et al point out that standard methods treat reasoning “purely through text,” causing precise geometric understanding to be lost.

Screenshot 2025-08-11 195927.png

SpatialEval further demonstrates this, tasks like top-down map navigation or counting objects by region can stump VLMs, and models frequently ignore image pixels if textual hints suffice. They emphasize that VLMs lack true 3D understanding (they can’t reliably judge distance or size without explicit 3D cues) and attribute this to training data missing metric depth.

In practice, this means even if a model identifies all objects, it may fail to relate them in space or track motion. Benchmarks reveal common failure modes such as confusing relative directions (left/right/up/down) and miscounting when objects are arranged unusually.

Key difficulties in spatial reasoning include

Geometry loss in text-based pipelines. Translating visuals into language inevitably discards some spatial detail. For example, describing “the cat is to the left of the dog” is simpler than encoding exact positions or distances, so models often oversimplify.
No built-in 3D knowledge. VLMs typically see only 2D images with no depth. SpatialVLM shows that without 3D-aware training data, VLMs cannot learn true metric relationships.
Benchmark hardness. Tasks designed to test spatial understanding (relationships, navigation, counting) are shown to be unusually hard. SpatialEval finds models performing at or below random on many spatial questions, highlighting a large gap.
Temporal and multi-view integration. For video or multi-view tasks, models must track movements or merge views. ViLaSR finds that vanilla VLMs fail to track object positions over time, prompting their “drawing” solution.

Compositional Reasoning (challenges and failure modes)

Compositional reasoning includes understanding how objects, attributes, and relations compose to form a scene. E.g, identifying “a red square on a white circle” requires binding the color “red” to the square and “white” to the circle. Research shows VLMs are weak here. Mishra et al observe that even GPT-4V and similar models often misinterpret simple compositions, like confusing the subject and object of a verb.

Zeng et al introduced the ARPGrounding benchmark and found that VLMs perform well on ordinary grounding tasks but have “strong deficiencies in compositional reasoning” when multiple attributes or relations combine. In other words, a model might recognize “dog” and “cat” in an image, but fail to link “chasing” correctly.

Moreover, existing benchmarks have been found biased, a Princeton analysis (CompGPT) shows that about 60% of caption pairs in one dataset were trivial (models could succeed by language priors, not vision). Even after fixing biases, VLMs generally get only moderate accuracy on compositional tests.

Common compositional failures

Attribute - object binding errors. VLMs might identify objects and attributes separately, but misassociate them (“blue square” vs “red square” confusion).
Relation confusion. Swapping roles (“dog chasing cat” vs “cat chasing dog”) often fools VLMs.
Lack of hierarchical reasoning. They treat an image caption as a flat string and often fail to perform multi-step logical reasoning implied by complex captions.
Benchmark biases. Many datasets contain superficial instances (word ordering, common phrase patterns). Projects like SugarCrepe and refined ARO aim to remove these, but results still show that models exploit language shortcuts rather than true visual logic.