IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation (opens in new tab)
Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit V...
Read the original article