Divide and Reason: Joint Image and Language Decomposition for Compositional Reasoning

Madhav Kanda†¹, Dwip Dalal†¹, Zhenhailong Wang¹, Heng Ji¹, Unnat Jain²
¹University of Illinois Urbana–Champaign  ²University of California, Irvine
† Equal contribution
📄 Paper 💻 Code 📊 Results 📚 BibTeX

TLDR

Compositional vision–language questions require reasoning over multiple entities and relations. We present Divide and Reason, a framework that jointly decomposes the image and the language into aligned, tractable subproblems. Given a question, our method derives a structured plan and iteratively grounds visual primitives (objects, attributes, relations) while refining the textual spans that guide each step. This joint image–language decomposition improves perceptual grounding, enables robust multi-hop reasoning, and reduces spurious shortcuts—without changing underlying model architectures. We observe consistent gains in accuracy and faithfulness across diverse benchmarks.