DeepEyesV2: Toward Agentic Multimodal Model
arxiv.org·6h
Flag this post

Title:DeepEyesV2: Toward Agentic Multimodal Model

View PDF HTML (experimental)

Abstract:Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to fu…

Similar Posts

Loading similar posts...