Vision–language models (VLMs) excel at tasks like image classification and captioning, but they often struggle with spatial and compositional reasoning. Spatial reasoning means understanding geometry and object layouts (like “Is the red cube left of the blue sphere?”), while compositional reasoning means correctly interpreting how attributes and relationships combine ( like distinguishing ‘dog chasing cat’ from ‘cat chasing dog’). These capabilities are crucial for applications like robotics, navigation, and scene understanding, yet studies show that even SOTA VLMs fail at them. For example, Wu et al note that large VLMs**“often struggle with spatial reasoning** despite strong basic vision abilities, and Mishra et al show models confusing simple compositions (dog vs. cat chasing). In SpatialEval (NeurIPS 2024), researchers find that spatial tasks are so challenging that VLMs can do worse than chance and even underperform pure language models, relying on text when visual detail is neededhttps://neurips.cc/virtual/2024/poster/94371#:~:text=cognition%E2%80%94remains%20under,become%20less%20reliant%20on%20visual In short, without special training, VLMs tend to get geometry and multi-step object relations wrong

Spatial Reasoning (challenges and failure modes)

Spatial reasoning involves understanding object positions, distances, orientations, and how they change (in a sequence of frames). Current VLMs often lose geometric information when converting images to text. For instance, Wu et al point out that standard methods treat reasoning “purely through text,” causing precise geometric understanding to be lost.

Screenshot 2025-08-11 195927.png

SpatialEval further demonstrates this, tasks like top-down map navigation or counting objects by region can stump VLMs, and models frequently ignore image pixels if textual hints suffice. They emphasize that VLMs lack true 3D understanding (they can’t reliably judge distance or size without explicit 3D cues) and attribute this to training data missing metric depth.

In practice, this means even if a model identifies all objects, it may fail to relate them in space or track motion. Benchmarks reveal common failure modes such as confusing relative directions (left/right/up/down) and miscounting when objects are arranged unusually.

Key difficulties in spatial reasoning include

Compositional Reasoning (challenges and failure modes)

Compositional reasoning includes understanding how objects, attributes, and relations compose to form a scene. E.g, identifying “a red square on a white circle” requires binding the color “red” to the square and “white” to the circle. Research shows VLMs are weak here. Mishra et al observe that even GPT-4V and similar models often misinterpret simple compositions, like confusing the subject and object of a verb.

Zeng et al introduced the ARPGrounding benchmark and found that VLMs perform well on ordinary grounding tasks but have strong deficiencies in compositional reasoning when multiple attributes or relations combine. In other words, a model might recognize “dog” and “cat” in an image, but fail to link “chasing” correctly.

Moreover, existing benchmarks have been found biased, a Princeton analysis (CompGPT) shows that about 60% of caption pairs in one dataset were trivial (models could succeed by language priors, not vision). Even after fixing biases, VLMs generally get only moderate accuracy on compositional tests.

Common compositional failures