arXiv:2508.13968v2 Announce Type: replace-cross
Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{deg}, 90{deg}, 180{deg}, and 270{deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench — a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information — including captions, depth maps, and more — or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{deg}) images, while certain models are able to identify upside-down (180{deg}) images. None can reliably distinguish between 90{deg} and 270{deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90{deg} and 270{deg} rotations, despite substantially improving the identification of 180{deg} images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.

By Admin