4DP-QA: Scalable QA for 4D Perception in Vision Language Models (opens in new tab) 🖼️CLIP Content type: Academic
Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on mo...
Read the original article