arXiv:2603.04950v2 Announce Type: replace-cross
Abstract: Differential medical VQA models compare multiple images to identify clinically meaningful changes and rely on vision encoders to capture fine-grained visual differences that reflect radiologists’ comparative diagnostic workflows. However, vision encoders trained using standard contrastive or classification objectives often fail to capture the subtle variations needed to distinguish true disease progression from acquisition-related variability. To address this limitation, we introduce a location-aware pretraining framework that incorporates automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These tasks promote the learning of fine-grained, spatially grounded visual representations. When integrated with a language model, our approach achieves state-of-the-art performance on medical difference VQA by accurately identifying and reasoning about clinically relevant changes in chest X-ray images.
THE AI TODAY 