Divide and Reason: Joint Image and Language Decomposition for Compositional Reasoning

Dwip Dalal†¹, Madhav Kanda†¹, Zhenhailong Wang¹, Heng Ji¹, Unnat Jain²
¹University of Illinois Urbana–Champaign  ²University of California, Irvine
† Equal contribution
📄 Paper 💻 Code 📊 Results 📚 BibTeX

Method Overview

Method overview diagram for Divide and Reason

TLDR

Compositional vision–language questions require reasoning over multiple entities and relations. We present Divide and Reason, a framework that jointly decomposes the image and the language into aligned, tractable subproblems. Given a question, our method derives a structured plan and iteratively grounds visual primitives (objects, attributes, relations) while refining the textual spans that guide each step. This joint image–language decomposition improves perceptual grounding, enables robust multi‑hop reasoning, and reduces spurious shortcuts—without changing underlying model architectures. We observe consistent gains in accuracy and faithfulness across diverse benchmarks.