Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
arxiv.org·5d
📊Learned Metrics
Preview
Report Post

View PDF HTML (experimental)

Abstract:Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model’s context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplif…

Similar Posts

Loading similar posts...