PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement
arxiv.org·2d
📊Learned Metrics
Preview
Report Post

View PDF HTML (experimental)

Abstract:Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collec…

Similar Posts

Loading similar posts...