From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
arxiv.org·14h
🧠Neural Compression
Preview
Report Post

View PDF HTML (experimental)

Abstract:Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs’ ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 questio…

Similar Posts

Loading similar posts...