Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
arxiv.org·1d
🤖Transformers
Preview
Report Post

View PDF HTML (experimental)

Abstract:While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly…

Similar Posts

Loading similar posts...